qurator-spk / dinglehopper

An OCR evaluation tool
Apache License 2.0
59 stars 13 forks source link

Honor TextEquiv index #33

Closed mikegerber closed 3 years ago

mikegerber commented 3 years ago

https://ocr-d.de/de/gt-guidelines/pagexml/pagecontent_xsd_Complex_Type_pc_TextLineType.html#TextLineType_TextEquiv

@JKamlah wrote:

This is due to the work with LAREX. We did some corrections with LAREX on line level to produce GT files. LAREX kept both the original text and the corrected text in the result file and separated them by index. The original text got the index 1 and the corrected ones index 0, not corrected lines got no index at all. I don't know if that is a LAREX specific procedure(?). Link to the LAREX example.

PAGE specs:

Used for sort order in case multiple TextEquivs are defined. The text content with the lowest index should be interpreted as the main text content.

(See https://github.com/qurator-spk/dinglehopper/issues/5#issuecomment-709986931)

mikegerber commented 3 years ago

In this example (thanks to @JKamlah!), OCR-D-GT_0008.xml contains corrections in the TextEquivs with the lowest index: larex-indexed-textequiv-jkamlah.zip

<TextLine id="l2">
<Coords points="301,270 1389,270 1389,306 301,306"/>
<TextEquiv index="0">
<Unicode>
sondere Schrift daraus zu machen. Locke scheint fort-
</Unicode>
</TextEquiv>
<TextEquiv index="1">
<Unicode>
gondere Schrift daraus zu machen. LDocke scheint fort—-
</Unicode>
</TextEquiv>
</TextLine>
mikegerber commented 3 years ago

@JKamlah I am going to implement it according to the PAGE specs, i.e. "take the TextEquiv with the lowest index (if there are multiple)". This seems to also be correct for your example. (Your code at https://github.com/JKamlah/dinglehopper selects by a user-specified index). Do you see a problem with that I might be missing?

JKamlah commented 3 years ago

Thank you @mikegerber for the quick response.

Do you see a problem with that I might be missing?

No, not at all. It would perfectly fits our needs. The only reason to keep the index selection option is comparing the corrected output with original one? A Use-Case would be, if you use ABBYY for OLR reasons and keep the ocr'd text, you can easily compare it with the new recognized text.

mikegerber commented 3 years ago

The only reason to keep the index selection option is comparing the corrected output with original one? A Use-Case would be, if you use ABBYY for OLR reasons and keep the ocr'd text, you can easily compare it with the new recognized text.

I'd suggest keeping the ABBYY results and the manually corrected files in separate file groups and compare those, e.g.

ocrd-dinglehopper -I OCR-ABBYY,OCR-ABBYY-CORRECTED -O OCR-ABBYY-CORRECTED-DIFF -P metrics false

This seems to make it a lot more explicit.

JKamlah commented 3 years ago

You are absolutely right, it is much more explicit. I mean this is more like a fundamental question or? If i have multiple versions (indexes) in my file, i could have the need to compare them or to compare a specific index to another file. But how often will that happen and should dinglehopper offer an option for these few cases?

mikegerber commented 3 years ago

There is - to my knowledge - nothing in the PAGE specs that says the index is anything more than a preference order, it just happens that LAREX seems to produce files where we could select by index. Another tool might just add indexes where something changed. So I'll recommend copying the files to a named file group.

As for getting the correct TextEquivs, I have fixed this today and will merge!