qurator-spk / dinglehopper

An OCR evaluation tool
Apache License 2.0
59 stars 13 forks source link

Warn if there is text missing in the ReadingOrder #59

Open mikegerber opened 3 years ago

mikegerber commented 3 years ago

For 00451941.gt.xml, dinglehopper-extract does not extract the header's text DE L'ESPRIT DE L'HOMME.

mikegerber commented 3 years ago

The header is in TextRegion r3, but the ReadingOrder only includes the main text in r1, so dinglehopper does only extract the main text. This means: The file is buggy, not dinglehopper.

However, we can do better by warning that any region is not included in the extracted text.