Closed mromanello closed 4 months ago
This should be fixed to keep the semantics of our schema clean and avoid pollution of the schema by some arbitrary data inconsistencies (what will be in c for unconverted coordinates is highly unclear and source dependent). It seems trivial to fix the situation. So, I still support my suggestion of not providing non-iiif compliant coordinates in the c property.
A new legacy oc
property for original coordinates in cases where we suspect an issue with the coordinates keeps the semantics of c
atomic and independent if they exist. The changes seem minimal.
If we decide to do so a change in the schema should be made. BTW, currently we define the cc boolean property twice instead of using a ref: https://github.com/impresso/impresso-schemas/blob/7570cafd4869b798fac8d2542421c1bc8744efeb/json/newspaper/page.schema.json#L16 https://github.com/impresso/impresso-schemas/blob/7570cafd4869b798fac8d2542421c1bc8744efeb/json/newspaper/contentitem.schema.json#L20
New coordinate error I just found while creating the IIIF phonebook documentation. Putting it here for when I do my deep dive in the coordinates issues.
I found that there is a coordinate error in BNL's code and in the canonical data (s3://canonical-data): width and height are exchanged. This is the case in the code, where height and width are exchanged when converting the coordinates. In the example I used for the IIIF issue (taken from s3://canonical-data): https://iiif.eluxemburgensia.lu/image/iiif/2/ark:70795%2ft8mg9c%2fpages%2f3/111,811,1146,1787/full/0/default.jpg, which should be: https://iiif.eluxemburgensia.lu/image/iiif/2/ark:70795%2ft8mg9c%2fpages%2f3/111,811,1787,1146/full/0/default.jpg.
I also checked on the App, and all seems fine in both prod and dev. I also checked the canonical bucket (s3://original-canonical-release), and rebuilt bucket (s3://canonical-rebuilt-release) that should have been used for the current release (jan 2020), as well as (s3://rebuilt-data) and the error is also in all 3.
I don't know how the error doesn't show on the app, probably a fix at some later stage or in the DB, but this means all the coordinates in BNL also need fixing. Note: The BNL links were already incorrect due to a change in the IIIF on BNL's side, so this change can be either be included in the patch, or we can re-ingest all once we have the new OCR (but it's undefined when it will arrive)
As pointed out by @simon-clematide the meaning of the coordinates field (
c
in the Page JSON schema) it's not very consistent at the moment, due to the issues we had when converting coordinates in the Olive OCR format.When refactoring, we need to make sure that:
c
are always consistent with the IIIF syntax (x,y,w,h
)oc
for 'original coordinates')The
cc
(converted coordinates) flag at the page level will still be required as it's needed by the IMP to decide whether for a given page is possible or not to highlight page regions (e.g. article segments).