Sort textlines with missing indices

b2m commented 3 years ago

Python's sorted method will fail with a TypeError when called with None and Integers:

>>> sorted([None, 1])
TypeError: '<' not supported between instances of 'int' and 'NoneType'

Therefore we are using float('inf') instead of None in case of missing textline indices.

mikegerber commented 3 years ago

The code looks good, I'm just not sure what a missing index value means in PAGE?

mikegerber commented 3 years ago

no index vs index=1 seems to be undefined (https://www.primaresearch.org/schema/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd), do you have any real data that has this specific problem? Maybe that illuminates the problem.

b2m commented 3 years ago

So I already planed to investigate this further.

Here is what I was experimenting with:

OCR via ocrd-tesserocr-recognize and ocrd-calamari-recognize on word level
ocrd-cis-align
ocrd-cis-postcorrect
Comparing the results from each OCR engine and the corrected one with Ground Truth via ocrd-dinglehopper.

Extract from ocrd-tesserocr-recognize:

<pc:TextLine id="FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
    <pc:Coords points="757,632 737,640 739,660 745,663 771,663 795,666 842,662 914,661 967,665 972,660 972,639 967,634 914,631 912,631 804,635"/>
    <pc:Word id="FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000_word0000">
        <pc:Coords points="915,661 929,661 929,640 751,642 751,663 850,661 914,661"/>
        <pc:TextEquiv conf="0.909321594238281">
            <pc:Unicode>Truppenteil:</pc:Unicode>
        </pc:TextEquiv>
    </pc:Word>
    <pc:TextEquiv conf="0.909321594238281">
        <pc:Unicode>Truppenteil:</pc:Unicode>
    </pc:TextEquiv>
</pc:TextLine>

Extract from ocrd-calamari-recognize:

<pc:TextLine id="FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
    <pc:Coords points="757,632 737,640 739,660 745,663 771,663 795,666 842,662 914,661 967,665 972,660 972,639 967,634 914,631 912,631 804,635"/>
    <pc:Word id="FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000_word0000">
        <pc:Coords points="751,632 927,630 927,666 751,668"/>
        <pc:TextEquiv>
            <pc:Unicode>Truppenteil:</pc:Unicode>
        </pc:TextEquiv>
    </pc:Word>
    <pc:TextEquiv conf="0.998966634273529">
        <pc:Unicode>Truppenteil:</pc:Unicode>
    </pc:TextEquiv>
</pc:TextLine>

Extract from ocrd-cis-align:

<pc:TextLine id="FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
    <pc:Coords points="757,632 737,640 739,660 745,663 771,663 795,666 842,662 914,661 967,665 972,660 972,639 967,634 914,631 912,631 804,635"/>
    <pc:Word id="FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000_word0000">
        <pc:Coords points="915,661 929,661 929,640 751,642 751,663 850,661 914,661"/>
        <pc:TextEquiv index="1" conf="0.909321594238281" dataType="ocrd-cis-word-alignment-master-ocr" dataTypeDetails="OCR-D-OCR-TESS/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000_word0000">
            <pc:Unicode>Truppenteil:</pc:Unicode>
        </pc:TextEquiv>
        <pc:TextEquiv index="2" conf="1." dataType="ocrd-cis-word-alignment" dataTypeDetails="OCR-D-OCR-CAL-ANTIQUA/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000_word0000">
            <pc:Unicode>Truppenteil:</pc:Unicode>
        </pc:TextEquiv>
    </pc:Word>
    <pc:TextEquiv index="1" conf="0.909321594238281" dataType="ocrd-cis-line-alignment-master-ocr" dataTypeDetails="OCR-D-OCR-TESS/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
        <pc:Unicode>Truppenteil:</pc:Unicode>
    </pc:TextEquiv>
    <pc:TextEquiv index="2" dataTypeDetails="OCR-D-OCR-CAL-ANTIQUA/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000"/>
    <pc:TextEquiv conf="0.998966634273529" dataType="ocrd-cis-line-alignment" dataTypeDetails="OCR-D-OCR-CAL-ANTIQUA/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
        <pc:Unicode>Truppenteil:</pc:Unicode>
    </pc:TextEquiv>
</pc:TextLine>

Extract from ocrd-cis-postcorrect:

<pc:TextLine id="FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
    <TextEquiv dataType="OCR-D-CIS-POST-CORRECTION" index="1">
        <Unicode>Truppenteil:</Unicode>
    </TextEquiv>
    <pc:Coords  points="757,632 737,640 739,660 745,663 771,663 795,666 842,662 914,661 967,665 972,660 972,639 967,634 914,631 912,631 804,635"/>
    <pc:Word id="FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000_word0000">
        <pc:Coords points="915,661 929,661 929,640 751,642 751,663 850,661 914,661"/>
        <pc:TextEquiv conf="0.909321594238281" dataType="ocrd-cis-word-alignment-master-ocr" dataTypeDetails="OCR-D-OCR-TESS/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000_word0000" index="1">
            <pc:Unicode>Truppenteil:</pc:Unicode>
        </pc:TextEquiv>
        <pc:TextEquiv conf="1." dataType="ocrd-cis-word-alignment" dataTypeDetails="OCR-D-OCR-CAL-ANTIQUA/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000_word0000" index="2">
            <pc:Unicode>Truppenteil:</pc:Unicode>
        </pc:TextEquiv>
    </pc:Word>
    <pc:TextEquiv conf="0.909321594238281" dataType="ocrd-cis-line-alignment-master-ocr" dataTypeDetails="OCR-D-OCR-TESS/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000" index="2">
        <pc:Unicode>Truppenteil:</pc:Unicode>
    </pc:TextEquiv>
    <pc:TextEquiv dataTypeDetails="OCR-D-OCR-CAL-ANTIQUA/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000" index="3"/>
    <pc:TextEquiv conf="0.998966634273529" dataType="ocrd-cis-line-alignment" dataTypeDetails="OCR-D-OCR-CAL-ANTIQUA/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
        <pc:Unicode>Truppenteil:</pc:Unicode>
    </pc:TextEquiv>
</pc:TextLine>

Somehow the results from ocrd-calamari-recognize have been written to two different TextEquiv nodes by ocrd-cis-align. One containing the index and the other the text content. This produces the described scenario of having both: nodes with index and nodes without.

mikegerber commented 3 years ago

So the issue is that ocrd-cis-align is writing this output (Coords and Word removed for clarity):

<pc:TextLine id="FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
    <pc:TextEquiv index="1" conf="0.909321594238281" dataType="ocrd-cis-line-alignment-master-ocr" dataTypeDetails="OCR-D-OCR-TESS/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
        <pc:Unicode>Truppenteil:</pc:Unicode>
    </pc:TextEquiv>
    <pc:TextEquiv index="2" dataTypeDetails="OCR-D-OCR-CAL-ANTIQUA/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000"/>
    <pc:TextEquiv conf="0.998966634273529" dataType="ocrd-cis-line-alignment" dataTypeDetails="OCR-D-OCR-CAL-ANTIQUA/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
        <pc:Unicode>Truppenteil:</pc:Unicode>
    </pc:TextEquiv>
</pc:TextLine>

This looks like a bug in ocrd-cis-align because it seems to produce (after the first correct-looking TextEquiv) two TextEquivs, one with an index and without text, one without an index and with text.

If you agree, we should probably open an issue at https://github.com/cisocrgroup/ocrd_cis

b2m commented 3 years ago

So the issue is that ocrd-cis-align is writing this output (Coords and Word removed for clarity):

We have two issues... the producer ocrd-cis-align producing unexpected content and the consumer dinglehopper crashing with a TypeError on this content. This pull request mitigates the problem on the side of dinglehopper.

This looks like a bug in ocrd-cis-align because it seems to produce (after the first correct-looking TextEquiv) two TextEquivs, one with an index and without text, one without an index and with text.

Exactly.

If you agree, we should probably open an issue at https://github.com/cisocrgroup/ocrd_cis

Yes, I was waiting on some time in between meetings to produce a smaller, reproducible example before opening a bug (cisocrgroup/ocrd_cis#76) with _ocrd_cis.

mikegerber commented 3 years ago

(I accidently edited your comment, but I think I managed to restore it...)

and the consumer dinglehopper crashing with a TypeError on this content. This pull request mitigates the problem on the side of dinglehopper.

Yes this is a second issue, that's why the PR is open :), I was just writing my comment on this:

I don't think it should consume this without a warning/error message. What I was thinking:

If there is only one TextEquiv, take it
If there multiple TextEquiv without index, but confidence values, WARN and take the one with highest confidence (It's debatable if this should be done, I believe only index matters in PAGE)
If there multiple TextEquiv with index, take the one with lowest index, as per PAGE schema
If there multiple TextEquiv mixed without index and with index, ERROR, throw away any with no index and take the one with lowest index

b2m commented 3 years ago

So I tried to convert your thinking into tests and code and it became more complex than the first solution. =)

I am not sure on the severity of the log level message. I would suggest to use info for confidence sorting and warn if we encounter mixed index/no-index nodes. Because this mixed mode is something we can handle and the user still gets a reasonable result.

mikegerber commented 3 years ago

Awesome! I'll merge it right away!

Minor nit-picks:

There's some unrelated re-formatting in the PR I do not like much, but I think I can live with it.
Because this mixed mode is something we can handle and the user still gets a reasonable result.

I must say that I do not agree fully. This happens when the input is invalid IMHO, we can just guess what's right, so I think that an actual error is appropriate. But for now, the warning is there and it's a lot better than crashing :+1:

If get_textequiv_unicode should return '' or None is something I will probably re-visit later

b2m commented 3 years ago

No "minor" nit-picks because I was thinking about the same things when implementing:

Reformating => mostly due to new .editorconfig => limit to line length of 90. So I thought this would be ok =)
Yes, No... Maybe? =)
None would be more explicit as whether there really was an empty entry or no text. But as we are not using this information right now, so I prefered a single return type.

mikegerber commented 3 years ago

Reformating => mostly due to new .editorconfig => limit to line length of 90. So I thought this would be ok =)

There are 3 major unsolved problems in computer science:

N = NP?
How to print a damn document without a paper jam
Let n software developers agree on a common code formatting style for all n>1

mikegerber commented 3 years ago

Joking aside, I think I'll just use the black code formatter in the future, reasonable results and no more arguing about bike sheds... eh code formatting.

qurator-spk / dinglehopper

Sort textlines with missing indices #37