qurator-spk / dinglehopper

An OCR evaluation tool
Apache License 2.0
59 stars 13 forks source link

Sort textlines with missing indices #37

Closed b2m closed 3 years ago

b2m commented 3 years ago

Python's sorted method will fail with a TypeError when called with None and Integers:

>>> sorted([None, 1])
TypeError: '<' not supported between instances of 'int' and 'NoneType'

Therefore we are using float('inf') instead of None in case of missing textline indices.

mikegerber commented 3 years ago

The code looks good, I'm just not sure what a missing index value means in PAGE?

mikegerber commented 3 years ago

no index vs index=1 seems to be undefined (https://www.primaresearch.org/schema/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd), do you have any real data that has this specific problem? Maybe that illuminates the problem.

b2m commented 3 years ago

So I already planed to investigate this further.

Here is what I was experimenting with:

  1. OCR via ocrd-tesserocr-recognize and ocrd-calamari-recognize on word level
  2. ocrd-cis-align
  3. ocrd-cis-postcorrect
  4. Comparing the results from each OCR engine and the corrected one with Ground Truth via ocrd-dinglehopper.

Extract from ocrd-tesserocr-recognize:

<pc:TextLine id="FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
    <pc:Coords points="757,632 737,640 739,660 745,663 771,663 795,666 842,662 914,661 967,665 972,660 972,639 967,634 914,631 912,631 804,635"/>
    <pc:Word id="FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000_word0000">
        <pc:Coords points="915,661 929,661 929,640 751,642 751,663 850,661 914,661"/>
        <pc:TextEquiv conf="0.909321594238281">
            <pc:Unicode>Truppenteil:</pc:Unicode>
        </pc:TextEquiv>
    </pc:Word>
    <pc:TextEquiv conf="0.909321594238281">
        <pc:Unicode>Truppenteil:</pc:Unicode>
    </pc:TextEquiv>
</pc:TextLine>

Extract from ocrd-calamari-recognize:

<pc:TextLine id="FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
    <pc:Coords points="757,632 737,640 739,660 745,663 771,663 795,666 842,662 914,661 967,665 972,660 972,639 967,634 914,631 912,631 804,635"/>
    <pc:Word id="FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000_word0000">
        <pc:Coords points="751,632 927,630 927,666 751,668"/>
        <pc:TextEquiv>
            <pc:Unicode>Truppenteil:</pc:Unicode>
        </pc:TextEquiv>
    </pc:Word>
    <pc:TextEquiv conf="0.998966634273529">
        <pc:Unicode>Truppenteil:</pc:Unicode>
    </pc:TextEquiv>
</pc:TextLine>

Extract from ocrd-cis-align:

<pc:TextLine id="FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
    <pc:Coords points="757,632 737,640 739,660 745,663 771,663 795,666 842,662 914,661 967,665 972,660 972,639 967,634 914,631 912,631 804,635"/>
    <pc:Word id="FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000_word0000">
        <pc:Coords points="915,661 929,661 929,640 751,642 751,663 850,661 914,661"/>
        <pc:TextEquiv index="1" conf="0.909321594238281" dataType="ocrd-cis-word-alignment-master-ocr" dataTypeDetails="OCR-D-OCR-TESS/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000_word0000">
            <pc:Unicode>Truppenteil:</pc:Unicode>
        </pc:TextEquiv>
        <pc:TextEquiv index="2" conf="1." dataType="ocrd-cis-word-alignment" dataTypeDetails="OCR-D-OCR-CAL-ANTIQUA/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000_word0000">
            <pc:Unicode>Truppenteil:</pc:Unicode>
        </pc:TextEquiv>
    </pc:Word>
    <pc:TextEquiv index="1" conf="0.909321594238281" dataType="ocrd-cis-line-alignment-master-ocr" dataTypeDetails="OCR-D-OCR-TESS/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
        <pc:Unicode>Truppenteil:</pc:Unicode>
    </pc:TextEquiv>
    <pc:TextEquiv index="2" dataTypeDetails="OCR-D-OCR-CAL-ANTIQUA/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000"/>
    <pc:TextEquiv conf="0.998966634273529" dataType="ocrd-cis-line-alignment" dataTypeDetails="OCR-D-OCR-CAL-ANTIQUA/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
        <pc:Unicode>Truppenteil:</pc:Unicode>
    </pc:TextEquiv>
</pc:TextLine>

Extract from ocrd-cis-postcorrect:

<pc:TextLine id="FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
    <TextEquiv dataType="OCR-D-CIS-POST-CORRECTION" index="1">
        <Unicode>Truppenteil:</Unicode>
    </TextEquiv>
    <pc:Coords  points="757,632 737,640 739,660 745,663 771,663 795,666 842,662 914,661 967,665 972,660 972,639 967,634 914,631 912,631 804,635"/>
    <pc:Word id="FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000_word0000">
        <pc:Coords points="915,661 929,661 929,640 751,642 751,663 850,661 914,661"/>
        <pc:TextEquiv conf="0.909321594238281" dataType="ocrd-cis-word-alignment-master-ocr" dataTypeDetails="OCR-D-OCR-TESS/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000_word0000" index="1">
            <pc:Unicode>Truppenteil:</pc:Unicode>
        </pc:TextEquiv>
        <pc:TextEquiv conf="1." dataType="ocrd-cis-word-alignment" dataTypeDetails="OCR-D-OCR-CAL-ANTIQUA/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000_word0000" index="2">
            <pc:Unicode>Truppenteil:</pc:Unicode>
        </pc:TextEquiv>
    </pc:Word>
    <pc:TextEquiv conf="0.909321594238281" dataType="ocrd-cis-line-alignment-master-ocr" dataTypeDetails="OCR-D-OCR-TESS/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000" index="2">
        <pc:Unicode>Truppenteil:</pc:Unicode>
    </pc:TextEquiv>
    <pc:TextEquiv dataTypeDetails="OCR-D-OCR-CAL-ANTIQUA/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000" index="3"/>
    <pc:TextEquiv conf="0.998966634273529" dataType="ocrd-cis-line-alignment" dataTypeDetails="OCR-D-OCR-CAL-ANTIQUA/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
        <pc:Unicode>Truppenteil:</pc:Unicode>
    </pc:TextEquiv>
</pc:TextLine>

Somehow the results from ocrd-calamari-recognize have been written to two different TextEquiv nodes by ocrd-cis-align. One containing the index and the other the text content. This produces the described scenario of having both: nodes with index and nodes without.

mikegerber commented 3 years ago

So the issue is that ocrd-cis-align is writing this output (Coords and Word removed for clarity):

<pc:TextLine id="FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
    <pc:TextEquiv index="1" conf="0.909321594238281" dataType="ocrd-cis-line-alignment-master-ocr" dataTypeDetails="OCR-D-OCR-TESS/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
        <pc:Unicode>Truppenteil:</pc:Unicode>
    </pc:TextEquiv>
    <pc:TextEquiv index="2" dataTypeDetails="OCR-D-OCR-CAL-ANTIQUA/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000"/>
    <pc:TextEquiv conf="0.998966634273529" dataType="ocrd-cis-line-alignment" dataTypeDetails="OCR-D-OCR-CAL-ANTIQUA/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
        <pc:Unicode>Truppenteil:</pc:Unicode>
    </pc:TextEquiv>
</pc:TextLine>

This looks like a bug in ocrd-cis-align because it seems to produce (after the first correct-looking TextEquiv) two TextEquivs, one with an index and without text, one without an index and with text.

If you agree, we should probably open an issue at https://github.com/cisocrgroup/ocrd_cis

b2m commented 3 years ago

So the issue is that ocrd-cis-align is writing this output (Coords and Word removed for clarity):

We have two issues... the producer ocrd-cis-align producing unexpected content and the consumer dinglehopper crashing with a TypeError on this content. This pull request mitigates the problem on the side of dinglehopper.

This looks like a bug in ocrd-cis-align because it seems to produce (after the first correct-looking TextEquiv) two TextEquivs, one with an index and without text, one without an index and with text.

Exactly.

If you agree, we should probably open an issue at https://github.com/cisocrgroup/ocrd_cis

Yes, I was waiting on some time in between meetings to produce a smaller, reproducible example before opening a bug (cisocrgroup/ocrd_cis#76) with _ocrd_cis.

mikegerber commented 3 years ago

(I accidently edited your comment, but I think I managed to restore it...)

and the consumer dinglehopper crashing with a TypeError on this content. This pull request mitigates the problem on the side of dinglehopper.

Yes this is a second issue, that's why the PR is open :), I was just writing my comment on this:

I don't think it should consume this without a warning/error message. What I was thinking:

b2m commented 3 years ago

So I tried to convert your thinking into tests and code and it became more complex than the first solution. =)

I am not sure on the severity of the log level message. I would suggest to use info for confidence sorting and warn if we encounter mixed index/no-index nodes. Because this mixed mode is something we can handle and the user still gets a reasonable result.

mikegerber commented 3 years ago

Awesome! I'll merge it right away!

Minor nit-picks:

  1. There's some unrelated re-formatting in the PR I do not like much, but I think I can live with it.

  2. Because this mixed mode is something we can handle and the user still gets a reasonable result.

I must say that I do not agree fully. This happens when the input is invalid IMHO, we can just guess what's right, so I think that an actual error is appropriate. But for now, the warning is there and it's a lot better than crashing :+1:

  1. If get_textequiv_unicode should return '' or None is something I will probably re-visit later
b2m commented 3 years ago

No "minor" nit-picks because I was thinking about the same things when implementing:

  1. Reformating => mostly due to new .editorconfig => limit to line length of 90. So I thought this would be ok =)
  2. Yes, No... Maybe? =)
  3. None would be more explicit as whether there really was an empty entry or no text. But as we are not using this information right now, so I prefered a single return type.
mikegerber commented 3 years ago
  1. Reformating => mostly due to new .editorconfig => limit to line length of 90. So I thought this would be ok =)

There are 3 major unsolved problems in computer science:

mikegerber commented 3 years ago

Joking aside, I think I'll just use the black code formatter in the future, reasonable results and no more arguing about bike sheds... eh code formatting.