Closed b2m closed 3 years ago
The code looks good, I'm just not sure what a missing index
value means in PAGE?
no index
vs index=1
seems to be undefined (https://www.primaresearch.org/schema/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd), do you have any real data that has this specific problem? Maybe that illuminates the problem.
So I already planed to investigate this further.
Here is what I was experimenting with:
Extract from ocrd-tesserocr-recognize:
<pc:TextLine id="FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
<pc:Coords points="757,632 737,640 739,660 745,663 771,663 795,666 842,662 914,661 967,665 972,660 972,639 967,634 914,631 912,631 804,635"/>
<pc:Word id="FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000_word0000">
<pc:Coords points="915,661 929,661 929,640 751,642 751,663 850,661 914,661"/>
<pc:TextEquiv conf="0.909321594238281">
<pc:Unicode>Truppenteil:</pc:Unicode>
</pc:TextEquiv>
</pc:Word>
<pc:TextEquiv conf="0.909321594238281">
<pc:Unicode>Truppenteil:</pc:Unicode>
</pc:TextEquiv>
</pc:TextLine>
Extract from ocrd-calamari-recognize:
<pc:TextLine id="FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
<pc:Coords points="757,632 737,640 739,660 745,663 771,663 795,666 842,662 914,661 967,665 972,660 972,639 967,634 914,631 912,631 804,635"/>
<pc:Word id="FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000_word0000">
<pc:Coords points="751,632 927,630 927,666 751,668"/>
<pc:TextEquiv>
<pc:Unicode>Truppenteil:</pc:Unicode>
</pc:TextEquiv>
</pc:Word>
<pc:TextEquiv conf="0.998966634273529">
<pc:Unicode>Truppenteil:</pc:Unicode>
</pc:TextEquiv>
</pc:TextLine>
Extract from ocrd-cis-align:
<pc:TextLine id="FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
<pc:Coords points="757,632 737,640 739,660 745,663 771,663 795,666 842,662 914,661 967,665 972,660 972,639 967,634 914,631 912,631 804,635"/>
<pc:Word id="FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000_word0000">
<pc:Coords points="915,661 929,661 929,640 751,642 751,663 850,661 914,661"/>
<pc:TextEquiv index="1" conf="0.909321594238281" dataType="ocrd-cis-word-alignment-master-ocr" dataTypeDetails="OCR-D-OCR-TESS/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000_word0000">
<pc:Unicode>Truppenteil:</pc:Unicode>
</pc:TextEquiv>
<pc:TextEquiv index="2" conf="1." dataType="ocrd-cis-word-alignment" dataTypeDetails="OCR-D-OCR-CAL-ANTIQUA/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000_word0000">
<pc:Unicode>Truppenteil:</pc:Unicode>
</pc:TextEquiv>
</pc:Word>
<pc:TextEquiv index="1" conf="0.909321594238281" dataType="ocrd-cis-line-alignment-master-ocr" dataTypeDetails="OCR-D-OCR-TESS/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
<pc:Unicode>Truppenteil:</pc:Unicode>
</pc:TextEquiv>
<pc:TextEquiv index="2" dataTypeDetails="OCR-D-OCR-CAL-ANTIQUA/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000"/>
<pc:TextEquiv conf="0.998966634273529" dataType="ocrd-cis-line-alignment" dataTypeDetails="OCR-D-OCR-CAL-ANTIQUA/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
<pc:Unicode>Truppenteil:</pc:Unicode>
</pc:TextEquiv>
</pc:TextLine>
Extract from ocrd-cis-postcorrect:
<pc:TextLine id="FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
<TextEquiv dataType="OCR-D-CIS-POST-CORRECTION" index="1">
<Unicode>Truppenteil:</Unicode>
</TextEquiv>
<pc:Coords points="757,632 737,640 739,660 745,663 771,663 795,666 842,662 914,661 967,665 972,660 972,639 967,634 914,631 912,631 804,635"/>
<pc:Word id="FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000_word0000">
<pc:Coords points="915,661 929,661 929,640 751,642 751,663 850,661 914,661"/>
<pc:TextEquiv conf="0.909321594238281" dataType="ocrd-cis-word-alignment-master-ocr" dataTypeDetails="OCR-D-OCR-TESS/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000_word0000" index="1">
<pc:Unicode>Truppenteil:</pc:Unicode>
</pc:TextEquiv>
<pc:TextEquiv conf="1." dataType="ocrd-cis-word-alignment" dataTypeDetails="OCR-D-OCR-CAL-ANTIQUA/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000_word0000" index="2">
<pc:Unicode>Truppenteil:</pc:Unicode>
</pc:TextEquiv>
</pc:Word>
<pc:TextEquiv conf="0.909321594238281" dataType="ocrd-cis-line-alignment-master-ocr" dataTypeDetails="OCR-D-OCR-TESS/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000" index="2">
<pc:Unicode>Truppenteil:</pc:Unicode>
</pc:TextEquiv>
<pc:TextEquiv dataTypeDetails="OCR-D-OCR-CAL-ANTIQUA/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000" index="3"/>
<pc:TextEquiv conf="0.998966634273529" dataType="ocrd-cis-line-alignment" dataTypeDetails="OCR-D-OCR-CAL-ANTIQUA/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
<pc:Unicode>Truppenteil:</pc:Unicode>
</pc:TextEquiv>
</pc:TextLine>
Somehow the results from ocrd-calamari-recognize have been written to two different TextEquiv nodes by ocrd-cis-align. One containing the index and the other the text content. This produces the described scenario of having both: nodes with index and nodes without.
So the issue is that ocrd-cis-align is writing this output (Coords and Word removed for clarity):
<pc:TextLine id="FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
<pc:TextEquiv index="1" conf="0.909321594238281" dataType="ocrd-cis-line-alignment-master-ocr" dataTypeDetails="OCR-D-OCR-TESS/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
<pc:Unicode>Truppenteil:</pc:Unicode>
</pc:TextEquiv>
<pc:TextEquiv index="2" dataTypeDetails="OCR-D-OCR-CAL-ANTIQUA/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000"/>
<pc:TextEquiv conf="0.998966634273529" dataType="ocrd-cis-line-alignment" dataTypeDetails="OCR-D-OCR-CAL-ANTIQUA/FILE_OCR-D-DESKEW-PAGE_0028_region0029_line0000">
<pc:Unicode>Truppenteil:</pc:Unicode>
</pc:TextEquiv>
</pc:TextLine>
This looks like a bug in ocrd-cis-align
because it seems to produce (after the first correct-looking TextEquiv
) two TextEquiv
s, one with an index and without text, one without an index and with text.
If you agree, we should probably open an issue at https://github.com/cisocrgroup/ocrd_cis
So the issue is that ocrd-cis-align is writing this output (Coords and Word removed for clarity):
We have two issues... the producer ocrd-cis-align producing unexpected content and the consumer dinglehopper crashing with a TypeError on this content. This pull request mitigates the problem on the side of dinglehopper.
This looks like a bug in ocrd-cis-align because it seems to produce (after the first correct-looking TextEquiv) two TextEquivs, one with an index and without text, one without an index and with text.
Exactly.
If you agree, we should probably open an issue at https://github.com/cisocrgroup/ocrd_cis
Yes, I was waiting on some time in between meetings to produce a smaller, reproducible example before opening a bug (cisocrgroup/ocrd_cis#76) with _ocrd_cis.
(I accidently edited your comment, but I think I managed to restore it...)
and the consumer dinglehopper crashing with a
TypeError
on this content. This pull request mitigates the problem on the side of dinglehopper.
Yes this is a second issue, that's why the PR is open :), I was just writing my comment on this:
I don't think it should consume this without a warning/error message. What I was thinking:
So I tried to convert your thinking into tests and code and it became more complex than the first solution. =)
I am not sure on the severity of the log level message. I would suggest to use info for confidence sorting and warn if we encounter mixed index/no-index nodes. Because this mixed mode is something we can handle and the user still gets a reasonable result.
Awesome! I'll merge it right away!
Minor nit-picks:
There's some unrelated re-formatting in the PR I do not like much, but I think I can live with it.
Because this mixed mode is something we can handle and the user still gets a reasonable result.
I must say that I do not agree fully. This happens when the input is invalid IMHO, we can just guess what's right, so I think that an actual error is appropriate. But for now, the warning is there and it's a lot better than crashing :+1:
get_textequiv_unicode
should return ''
or None
is something I will probably re-visit laterNo "minor" nit-picks because I was thinking about the same things when implementing:
- Reformating => mostly due to new .editorconfig => limit to line length of 90. So I thought this would be ok =)
There are 3 major unsolved problems in computer science:
Joking aside, I think I'll just use the black code formatter in the future, reasonable results and no more arguing about bike sheds... eh code formatting.
Python's
sorted
method will fail with a TypeError when called withNone
and Integers:Therefore we are using
float('inf')
instead ofNone
in case of missing textline indices.