Closed piconti closed 5 months ago
To build the docs correctly with autodoc, the data versioning branch impresso-commons had to be specifically provided in the requirements.txt . This should be removed and rechanged as soon as the branch is merged into master in the impresso-pycommons repository.
Overall Description
This pull-request represents quite a substantial amount of patches and small fixes, as well as some additions which were implemented during October 2023 and March 2024.
In particular, the modifications brought by this branch aim to solve long-lasting issues with the IIIF links of image content-items in Issues which had been identified, as part of issues #103, #104 and #105 . Upon closer inspection of the situation concerning these issues, it was found that the situation varied significantly from importer to importer, as described in issue #117 which aims at addressing all these issues together. In addition, issue #74 about the wrong ordering of content-items was also targetted during this patching.
As part of the new and upcoming release (correcting the old data and adding newly obtained data), all the necessary patches and corrections to do were aggregated in this google sheet document, and some additional issues (#20, #126) ended up being opened or addressed through this patching. When data was simply patched instead of being re-ingested, the corresponding scripts or notebooks were also added for traceability.
Finally, since a data versioning approach was implemented before the generation of the updated data, it was integrated into the text-importer core logic. This allowed us to track statistics on the generated canonical data and identify potential problems with the data during re-ingestion #116. Note that all the data generated or patched for the next and upcoming release was done using this branch, and follow the updated JSON schemas as described in this pull-request. As a result the changes to be merged here reflect the updates made to the data.
Precise changes and patches
Now for a more precise and exhaustive list of changes and patches:
Changes made to the code of the importers
iiif_link
(insidem
) andc
(outsidem
) for content-items of type image in issues117, #104, #105
iiif_link
orc
for content-items of type image in issues, one of:117, #103, #104, #105, #20
iiif_manifest_uri
property to issues when the institution provides a IIIF presentation API117
iiif
property byiiif_img_base_uri
in pages, and adapting their values to only be the URI base (excludinginfo.json
or[...]/default.jpg
suffixes)117
ro
(reading order) property to the metadata (m
) of the content-items in issues to improve the Table of Contents display on the interface.74
generic_importer.py
andcore.py
: Integration of the manifest instantiation and computation. Now whenever data ingested with the text-importer, a corresponding data manifest will be generated along with it an uploaded to the corresponding S3 bucket.116, pycommons issues #81 and #83
Patches implemented as scripts or added to correct existing problems
126 (Note that not all titles with problems were fixed. In particular
LES
andEXP
faced other more complex problems interacting with the wrong coordinates, which will be fixed at another time)Additionally, note that patch 4 (content-item - article matching for BNL) could not be handled as part of this PR as it would probably require a substantial rethinking of the importer's logic. This will be tackled in a future PR once the BNL data has arrived.
Based on all these changes, the version was updated to 1.1.0. This PR closes issues: #103, #104, #105, #117, #74, #20, #116