Open bertsky opened 1 month ago
Here is the report.zip of the PageValidator (ocrd workspace -d data/ger validate -s imagefilename -s pixel_density -s mets_fileid_page_pcgtsid --page-textequiv-consistency strict --page-strictness strict --page-coordinate-consistency both > report.xml
).
Thanks you for contributing! Before applying I will take a closer look at the report data and so some comparisons with recent digital-eval to see if it has measurable effects, since consistent QA is my main concern.
On first sight I noticed several ID-errors which usually indicate duplicates within the METS-file, which shouldn't happen of course. Maybe there's a even severe problem in the way the METS got assembled. For the inconsistencies concerning image heights, it's like @stweil noted: it's from the image's footer, which gets ignored during OCRing. I'm not sure whether this is a real problem.
On first sight I noticed several ID-errors which usually indicate duplicates within the METS-file, which shouldn't happen of course. Maybe there's a even severe problem in the way the METS got assembled.
It's only affecting the logical divs, but yes, these do clash (same logical div used for successive physical pages).
The other errors (see report here or repair log in the other repo) mostly concern coordinate consistency (words not properly contained in the line or lines not contained in the region), invalid polygon paths, and textual consistency (region level does not seem to have been projected from line level after last changes).
For the inconsistencies concerning image heights, it's like @stweil noted: it's from the image's footer, which gets ignored during OCRing. I'm not sure whether this is a real problem.
It's not, fortunately. Earlier we decided not to "look" at these attributes in core's image/coordinate functions, but always determine the actual image's size. So these only worry the validator.
This addresses two issues:
@imageFilename
references cannot be handled by OCR-D (core, browser, processors), because they are neither available in the filesystem nor (as URLs) in the METS. They have obviously been "inspired" by the@CONTENTIDS
in the METS, but are not string-identical (and we have no tool support for that kind of correspondence anyway).My PR tries to minimise the changeset (by reproducing the same NS prefixes and indentations), but due to the nature of schema-generated serialised XML, some cosmetic differences remain.
If you wish, I can do the same with the other repos.