impresso / impresso-text-acquisition

Python library to import OCR data in various formats into the canonical JSON format defined by the Impresso project.
https://impresso.github.io/impresso-text-acquisition/
GNU Affero General Public License v3.0
7 stars 2 forks source link

Bugfix invalid ci metadata #127

Closed piconti closed 5 months ago

piconti commented 5 months ago

Overall Description

This pull-request represents quite a substantial amount of patches and small fixes, as well as some additions which were implemented during October 2023 and March 2024.

In particular, the modifications brought by this branch aim to solve long-lasting issues with the IIIF links of image content-items in Issues which had been identified, as part of issues #103, #104 and #105 . Upon closer inspection of the situation concerning these issues, it was found that the situation varied significantly from importer to importer, as described in issue #117 which aims at addressing all these issues together. In addition, issue #74 about the wrong ordering of content-items was also targetted during this patching.

As part of the new and upcoming release (correcting the old data and adding newly obtained data), all the necessary patches and corrections to do were aggregated in this google sheet document, and some additional issues (#20, #126) ended up being opened or addressed through this patching. When data was simply patched instead of being re-ingested, the corresponding scripts or notebooks were also added for traceability.

Finally, since a data versioning approach was implemented before the generation of the updated data, it was integrated into the text-importer core logic. This allowed us to track statistics on the generated canonical data and identify potential problems with the data during re-ingestion #116. Note that all the data generated or patched for the next and upcoming release was done using this branch, and follow the updated JSON schemas as described in this pull-request. As a result the changes to be merged here reflect the updates made to the data.

Precise changes and patches

Now for a more precise and exhaustive list of changes and patches:

Changes made to the code of the importers

  1. BNF, BNF-EN, RERO: Correction of the placement of iiif_link (inside m) and c (outside m) for content-items of type image in issues
    • 117, #104, #105

  2. BNF, BNF-EN, RERO, BNL: Correction of the value of iiif_link or c for content-items of type image in issues, one of:
    • Changing to the image information IIIF URI
    • Correcting the exchange of height and width coordinates
    • Updating the ARK based URIs based on changes from the corresponding API
    • 117, #103, #104, #105, #20

  3. SWA, BNF, BNF-EN, FedGaz: Addition of the newly defined iiif_manifest_uri property to issues when the institution provides a IIIF presentation API
    • 117

  4. All importers: Replacement of the iiif property by iiif_img_base_uri in pages, and adapting their values to only be the URI base (excluding info.json or [...]/default.jpg suffixes)
    • 117

  5. BNF, BNF-EN, BNL (Lux), RERO (2 & 3): Addition of the ro (reading order) property to the metadata (m) of the content-items in issues to improve the Table of Contents display on the interface.
    • 74

  6. generic_importer.py and core.py: Integration of the manifest instantiation and computation. Now whenever data ingested with the text-importer, a corresponding data manifest will be generated along with it an uploaded to the corresponding S3 bucket.
    • 116, pycommons issues #81 and #83

Patches implemented as scripts or added to correct existing problems

  1. RERO1 - Olive: Correction of the coordinates by means of a rescaling, as described here
    • 126 (Note that not all titles with problems were fixed. In particular LES and EXP faced other more complex problems interacting with the wrong coordinates, which will be fixed at another time)

Additionally, note that patch 4 (content-item - article matching for BNL) could not be handled as part of this PR as it would probably require a substantial rethinking of the importer's logic. This will be tackled in a future PR once the BNL data has arrived.

Based on all these changes, the version was updated to 1.1.0. This PR closes issues: #103, #104, #105, #117, #74, #20, #116

piconti commented 5 months ago

To build the docs correctly with autodoc, the data versioning branch impresso-commons had to be specifically provided in the requirements.txt . This should be removed and rechanged as soon as the branch is merged into master in the impresso-pycommons repository.