google-research-datasets / hiertext

The HierText dataset contains ~12k images from the Open Images dataset v6 with large amount of text entities. We provide word, line and paragraph level annotations.
Creative Commons Attribution Share Alike 4.0 International
261 stars 23 forks source link

Question about illegible paragraphs #14

Closed Asafgendler closed 1 year ago

Asafgendler commented 1 year ago

Hey, thanks for the great dataset.

I wanted to ask if you use the illegible paragraphs when training the line grouping head (the layout branch), or are those paragraphs ignored.

Jyouhou commented 1 year ago

Yes. Check these:

Paragraph loss: https://github.com/tensorflow/models/blob/1879fa0c7f59b89f08abb89826e919d5ed4dd9ce/official/projects/unified_detector/modeling/universal_detector.py#L769

Paragraph loss mask: https://github.com/tensorflow/models/blob/1879fa0c7f59b89f08abb89826e919d5ed4dd9ce/official/projects/unified_detector/modeling/universal_detector.py#L826

Asafgendler commented 1 year ago

Can you explain then what is the gt_affinity_mask for? is it making the loss ignore illegible lines? or is it making the loss to ignore lines which belong to illegible paragraphs?

Jyouhou commented 1 year ago

The latter one: it is making the loss to ignore lines which belong to illegible paragraphs (or does not have a paragraph label)

Asafgendler commented 1 year ago

Thanks for the clarification, So two follow-up questions.

  1. How is it possible that a line won't have a paragraph label in this dataset?
  2. Why are we ignoring lines in illegible paragraphs? can't we be sure that they indeed belong to the same group? I thought the illegible flag only meant that the paragraph vertices can't be trusted.
Jyouhou commented 1 year ago

A1: It's up to the annotator's decision; Sometimes it's because it's too dense and messy.

A2: Illegible paragraphs means this paragraph group might be a merge of multiple groups i.e. the annotators did not annotate the paragraph labels for these lines. Paragraph vertices are not meant to be accurate. Annotators annotate paragraph grouping by drawing polygons. Lines falling into the same polygon belong to the same group. So, these vertices do not represent the paragraph shape. They are only used to group the lines.

Asafgendler commented 1 year ago

Thank you very much