impresso / impresso-text-acquisition

Python library to import OCR data in various formats into the canonical JSON format defined by the Impresso project.
https://impresso.github.io/impresso-text-acquisition/
GNU Affero General Public License v3.0
7 stars 2 forks source link

change in BNL's ARK-based URLs #103

Closed mromanello closed 5 months ago

mromanello commented 4 years ago

(to be fixed by August 31, 2020)

mromanello commented 4 years ago

Update: this has been hot-fixed in the dev and prod DBs (see code in this notebook), but it still needs to be fixed in the BNL importer as well as in the canonical data.

piconti commented 11 months ago

After further checking of the problem with @e-maud, we identified that the wrong links remain the Solr, but that the correct ones are fetched from the DB for the interface. As a result, we will not reingest the BNL data to fix this issue for the end of 2023 baby release. Rather, the canonical data will be patched to correct the URLs to prevent unnecessary large re-processing.

In parallel, the BNL importer code was corrected (commit), and a specific test for the iiif links was created in order to ensure the issue does not remain in future BNL canonical data.

piconti commented 11 months ago

Update: Following the data sharing meeting with BNL, we now know that they have new OCR for their entire collection, including the data that we already had during Impresso 1. Since the issue was patched in the DBs, that it currently works and that this patch/update would not impact the enrichment processing tasks (only the iiif links would be modified), it was decided that it would not be changed in the current canonical data before the Dec. 2023 release, and instead fully re-ingested once the updated orignal data from BNL is available.

However, the number of CIs in the current data will still be checked, to verify it's coherent with the statistics computed by Matteo Romanello in Sept. 2020.