impresso / impresso-text-acquisition

Python library to import OCR data in various formats into the canonical JSON format defined by the Impresso project.
https://impresso.github.io/impresso-text-acquisition/
GNU Affero General Public License v3.0
7 stars 2 forks source link

make semantics of coordinates more consistent #20

Closed mromanello closed 4 months ago

mromanello commented 5 years ago

As pointed out by @simon-clematide the meaning of the coordinates field (c in the Page JSON schema) it's not very consistent at the moment, due to the issues we had when converting coordinates in the Olive OCR format.

When refactoring, we need to make sure that:

The cc (converted coordinates) flag at the page level will still be required as it's needed by the IMP to decide whether for a given page is possible or not to highlight page regions (e.g. article segments).

simon-clematide commented 10 months ago

This should be fixed to keep the semantics of our schema clean and avoid pollution of the schema by some arbitrary data inconsistencies (what will be in c for unconverted coordinates is highly unclear and source dependent). It seems trivial to fix the situation. So, I still support my suggestion of not providing non-iiif compliant coordinates in the c property.

A new legacy oc property for original coordinates in cases where we suspect an issue with the coordinates keeps the semantics of c atomic and independent if they exist. The changes seem minimal.

If we decide to do so a change in the schema should be made. BTW, currently we define the cc boolean property twice instead of using a ref: https://github.com/impresso/impresso-schemas/blob/7570cafd4869b798fac8d2542421c1bc8744efeb/json/newspaper/page.schema.json#L16 https://github.com/impresso/impresso-schemas/blob/7570cafd4869b798fac8d2542421c1bc8744efeb/json/newspaper/contentitem.schema.json#L20

piconti commented 10 months ago

New coordinate error I just found while creating the IIIF phonebook documentation. Putting it here for when I do my deep dive in the coordinates issues.

I found that there is a coordinate error in BNL's code and in the canonical data (s3://canonical-data): width and height are exchanged. This is the case in the code, where height and width are exchanged when converting the coordinates. In the example I used for the IIIF issue (taken from s3://canonical-data): https://iiif.eluxemburgensia.lu/image/iiif/2/ark:70795%2ft8mg9c%2fpages%2f3/111,811,1146,1787/full/0/default.jpg, which should be: https://iiif.eluxemburgensia.lu/image/iiif/2/ark:70795%2ft8mg9c%2fpages%2f3/111,811,1787,1146/full/0/default.jpg.

I also checked on the App, and all seems fine in both prod and dev. I also checked the canonical bucket (s3://original-canonical-release), and rebuilt bucket (s3://canonical-rebuilt-release) that should have been used for the current release (jan 2020), as well as (s3://rebuilt-data) and the error is also in all 3.

I don't know how the error doesn't show on the app, probably a fix at some later stage or in the DB, but this means all the coordinates in BNL also need fixing. Note: The BNL links were already incorrect due to a change in the IIIF on BNL's side, so this change can be either be included in the patch, or we can re-ingest all once we have the new OCR (but it's undefined when it will arrive)