qurator-spk / dinglehopper

An OCR evaluation tool
Apache License 2.0
59 stars 13 forks source link

Fix the extraction of text from Page with TableRegion #50

Closed b2m closed 3 years ago

b2m commented 3 years ago

When experimenting with OCR-D Workflows for tables I recognized very bad error rates reported by dinglehopper when using the find_table=true option for ocrd-tesserocr-recognize.

The reason was, that dinglehopper did not consider OrderedGroupIndex in the OrderedGroup element when extracting text regions. As a consequence the table regions are not considered for text extraction.

This pull request fixes this by recursively adding text regions in case of OrderedGroupIndex.

mikegerber commented 3 years ago

This is on the agenda for 2021-01, sorry for not doing it earlier :) Merry christmas!