ulb-sachsen-anhalt / ulb-groundtruth-eval-odem-ger

OCR Grountruth ULB VD18 German Fraktur - OCR-D Phase III
https://ulb-sachsen-anhalt.github.io/ulb-groundtruth-eval-odem-ger/
Creative Commons Attribution Share Alike 4.0 International
4 stars 3 forks source link

non-printspace regions partially missing #8

Open bertsky opened 1 month ago

bertsky commented 1 month ago

I noticed that on some pages there only segments within the printspace are annotated, so there are no text regions for catch-words, page numbers, headers etc. There is only a Border annotation, no PrintSpace element, so this seems somewhat inconsistent. Also, it only affects some pages.

This is a problem if used as structural GT to train segmentation models.

I could run an incremental segmentation to automatically "find" these segments and make a PR or visual comparison if you want.

M3ssman commented 1 month ago

@bertsky I encourage any kind of improvement to enhance data usability, but can you point me to an example? I'm not sure whether I got the issue right and how something within the PrintSpace can be marked without being a TextRegion.

bertsky commented 1 month ago

phys1278993

Here, in the footer of the page, the signature mark and page number are not annotated.

phys1290695

On this example, the running title in the header and catch word in the footer are not annoted.

In both cases, there is a Border element (more or less precisely) around the physical page (as it should be), but no PrintSpace element. The latter is only required on GT level 3, but practically having no PrintSpace element and no segments outside of the print space (headers/footers) is difficult for use as layout training data.