Closed mikegerber closed 3 years ago
In this example (thanks to @JKamlah!), OCR-D-GT_0008.xml contains corrections in the TextEquivs with the lowest index: larex-indexed-textequiv-jkamlah.zip
<TextLine id="l2">
<Coords points="301,270 1389,270 1389,306 301,306"/>
<TextEquiv index="0">
<Unicode>
sondere Schrift daraus zu machen. Locke scheint fort-
</Unicode>
</TextEquiv>
<TextEquiv index="1">
<Unicode>
gondere Schrift daraus zu machen. LDocke scheint fort—-
</Unicode>
</TextEquiv>
</TextLine>
@JKamlah I am going to implement it according to the PAGE specs, i.e. "take the TextEquiv with the lowest index
(if there are multiple)". This seems to also be correct for your example. (Your code at https://github.com/JKamlah/dinglehopper selects by a user-specified index). Do you see a problem with that I might be missing?
Thank you @mikegerber for the quick response.
Do you see a problem with that I might be missing?
No, not at all. It would perfectly fits our needs. The only reason to keep the index selection option is comparing the corrected output with original one? A Use-Case would be, if you use ABBYY for OLR reasons and keep the ocr'd text, you can easily compare it with the new recognized text.
The only reason to keep the index selection option is comparing the corrected output with original one? A Use-Case would be, if you use ABBYY for OLR reasons and keep the ocr'd text, you can easily compare it with the new recognized text.
I'd suggest keeping the ABBYY results and the manually corrected files in separate file groups and compare those, e.g.
ocrd-dinglehopper -I OCR-ABBYY,OCR-ABBYY-CORRECTED -O OCR-ABBYY-CORRECTED-DIFF -P metrics false
This seems to make it a lot more explicit.
You are absolutely right, it is much more explicit. I mean this is more like a fundamental question or? If i have multiple versions (indexes) in my file, i could have the need to compare them or to compare a specific index to another file. But how often will that happen and should dinglehopper offer an option for these few cases?
There is - to my knowledge - nothing in the PAGE specs that says the index is anything more than a preference order, it just happens that LAREX seems to produce files where we could select by index. Another tool might just add indexes where something changed. So I'll recommend copying the files to a named file group.
As for getting the correct TextEquivs, I have fixed this today and will merge!
https://ocr-d.de/de/gt-guidelines/pagexml/pagecontent_xsd_Complex_Type_pc_TextLineType.html#TextLineType_TextEquiv
@JKamlah wrote:
PAGE specs:
(See https://github.com/qurator-spk/dinglehopper/issues/5#issuecomment-709986931)