Fix OCR-D - Githubissues

bertsky commented 1 month ago

This addresses two issues:

The @imageFilename references cannot be handled by OCR-D (core, browser, processors), because they are neither available in the filesystem nor (as URLs) in the METS. They have obviously been "inspired" by the @CONTENTIDS in the METS, but are not string-identical (and we have no tool support for that kind of correspondence anyway).
- The fix is to write the corresponding URL ref from OCR-D-IMG into the PAGE file.
A few coordinate invalidities and inconsistencies remain, which may cause problems – depending on the use-case.
- The fix uses ocrd-segment-repair to run the PageValidator and attempt direct repairs whereever possible.

My PR tries to minimise the changeset (by reproducing the same NS prefixes and indentations), but due to the nature of schema-generated serialised XML, some cosmetic differences remain.

If you wish, I can do the same with the other repos.

bertsky commented 1 month ago

Here is the report.zip of the PageValidator (ocrd workspace -d data/ger validate -s imagefilename -s pixel_density -s mets_fileid_page_pcgtsid --page-textequiv-consistency strict --page-strictness strict --page-coordinate-consistency both > report.xml).

M3ssman commented 1 month ago

Thanks you for contributing! Before applying I will take a closer look at the report data and so some comparisons with recent digital-eval to see if it has measurable effects, since consistent QA is my main concern.

On first sight I noticed several ID-errors which usually indicate duplicates within the METS-file, which shouldn't happen of course. Maybe there's a even severe problem in the way the METS got assembled. For the inconsistencies concerning image heights, it's like @stweil noted: it's from the image's footer, which gets ignored during OCRing. I'm not sure whether this is a real problem.

bertsky commented 1 month ago

On first sight I noticed several ID-errors which usually indicate duplicates within the METS-file, which shouldn't happen of course. Maybe there's a even severe problem in the way the METS got assembled.

It's only affecting the logical divs, but yes, these do clash (same logical div used for successive physical pages).

The other errors (see report here or repair log in the other repo) mostly concern coordinate consistency (words not properly contained in the line or lines not contained in the region), invalid polygon paths, and textual consistency (region level does not seem to have been projected from line level after last changes).

For the inconsistencies concerning image heights, it's like @stweil noted: it's from the image's footer, which gets ignored during OCRing. I'm not sure whether this is a real problem.

It's not, fortunately. Earlier we decided not to "look" at these attributes in core's image/coordinate functions, but always determine the actual image's size. So these only worry the validator.

ulb-sachsen-anhalt / ulb-groundtruth-eval-odem-ger

Fix OCR-D #7