google-research-datasets / hiertext

The HierText dataset contains ~12k images from the Open Images dataset v6 with large amount of text entities. We provide word, line and paragraph level annotations.
Creative Commons Attribution Share Alike 4.0 International
261 stars 23 forks source link

Empty paragraphs #8

Closed Asafgendler closed 1 year ago

Asafgendler commented 1 year ago

Hello and thanks for the great dataset.

I hove noticed that the data contains a lot of empty paragraphs, meaning, paragraphs that do not have any lines or words inside them but do have a defining polygon.

I would really appreciate if you can explain what is the meaning of those paragraph objects.

Thanks,

Jyouhou commented 1 year ago

Can you point me to the image IDs?

Also, as noted in the README: "legible": true, // If false, the region defined by vertices are considered as do-not-care in paragraph level evaluation.

What are the "legible" fields of them?

Asafgendler commented 1 year ago

They do no have any legible field as their only field is the vertices field, just wondered the reason for those paragraphs existence

Jyouhou commented 1 year ago

They represent text regions that are highly illegible and should be ignored in paragraph evaluation.

Asafgendler commented 1 year ago

OK thanks, are all the not legible entities not used during training as well?

Jyouhou commented 1 year ago

Lines/words are used. Illegible paragraphs are masked. Please refer to the code.

Asafgendler commented 1 year ago

Thanks