impresso / impresso-text-acquisition

Python library to import OCR data in various formats into the canonical JSON format defined by the Impresso project.
https://impresso.github.io/impresso-text-acquisition/
GNU Affero General Public License v3.0
7 stars 2 forks source link

[Various Importers] Inconsistencies in iiif links and coordinates in code and JSONs #117

Closed piconti closed 4 months ago

piconti commented 11 months ago

This issue is directly linked to issues #104 and #105, but after a first investigation about the current situation, it seems multiple importers present inconsistencies in where and how they create the iiif links and coordinates. As a result, a general issue covering all importers seemed more adapted. Sorry in advance, it is very long, but I felt that having the detailed situation somewhere would be useful to have a broader view.

Context

The two previous issues #104 and #105 highlighted that some importers generated invalid Issue JSONs for image content items. In particular the iiif-links were not correct, and depending on the importer, iiif-links or coordinates were misplaced in the content-item object in the output JSON.

Current state of the importer code and JSON files

1. BNF

Summary

importer CI iiif links to iiif_link placement c placement Page iiif links to
BNF JPG image of specific CI inside metadata None, missing a "full" gallica manifest
BNF-EN JPG image of specific CI outside metadata inside metadata JPG image of the full page,
RERO JPG image of specific CI (manifest in the s3) outside metadata (inside in the s3) inside metadata (outside in the s3) Page manifest
Lux/BNL JSON page manifest (but faulty) inside metadata outside metadata JSON manifest (but faulty)
SWA N/A N/A N/A unkown, is faulty
Olive N/A N/A outside metadata Page manifest
Tetml N/A N/A N/A Page manifest

What is expected

Proposed Approach

(feedback welcome)

piconti commented 10 months ago

Small update regarding SWA:

Notes and discussions from Impresso I were found and contain the following information: (NB: it's not assured this information is still correct).

Unfortunately, the information at disposal does not mention the links not working.

We can reach another JSON manifest by adding ".tif" to the end of the current links. Adding ".jp2" or ".png" yields 403 errors again, so it's probably that only authenticated users can access them. It is still unclear if the links in the canonical data were left as-is on purpose, or were supposed to be modified to map to the JP2000.

simon-clematide commented 10 months ago

Thanks a lot for the thorough review of this confusing situation. Here a few comments and suggestions:

piconti commented 10 months ago

Thank you for the feedback and informations:

simon-clematide commented 10 months ago

News from SWA (translated from German): They actually adopted the newest standards... the contact person is martin.reisacher@unibas.ch . If think from our side nothing speaks against ark persistent identifiers for the image pages. As long as we can map them from the old one, it seems a trivial one-time change.

Dear Simon, dear Elias,

I wanted to earn the badge (at least halfway) :-) the redirection to info.json doesn't work correctly without the image end. I would have to have a look at this, but it could take a while, I'm not sure whether sipi or our resolver is buggy here. With info.json or .jpx it works, and the direct image links also work. This was probably not noticed because it worked / was tested in the universal viewer at the time. info.json without image link https://ub-sipi.ub.unibas.ch/impresso/BAU_1_000094152_1873_1746/info.json (refers to the jpx version) Image: https://ub-sipi.ub.unibas.ch/impresso/BAU_1_000094152_1873_1746/full/max/0/default.jpg with jpx extension: https://ub-sipi.ub.unibas.ch/impresso/BAU_1_000094152_1873_1746.jpx The references for the XML files were missing in the resolver. These should now work. But I have only made random samples https://ub-resolver.ub.unibas.ch/impresso/BAU_1_000059110_1952_1121.xml https://ub-resolver.ub.unibas.ch/impresso/BAU_1_000059110_1908_0033.xml https://ub-resolver.ub.unibas.ch/impresso/BAU_1_000094152_1873_1746.xml

In general, the Impresso project is only on my radar at the moment. We are currently rebuilding our IIIF infrastructure and would also like to introduce persistent identifiers (arks) to ensure better persistence. We are also constantly considering whether there are ways to publish the data differently (as there are always

piconti commented 10 months ago

Thank you Simon for this update and Martin's contact.

I also agree there is no issue in changing to ark identifiers, especially before SWA data is on the interface. I can contact directly Martin to have more information on when this change to ark identifiers would take place to know what would be the best approach.

Also good to know which suffix is necessary to make the current links work (in particular that SWa needs /full/max/0.. instead of /full/full/0.... This means the rebuilder needs to take into consideration which title it's rebuilding when constructing the final link.

simon-clematide commented 10 months ago

If the iiif Servers have different urls we should maybe add a property to the data the specifies the way how to access info or images. There are probably only a few ways.

piconti commented 10 months ago

Yes there are a limited number of ways, but all the necessary information is already present in the canonical data, so it should not be necessary to add a new property.

I just updated the rebuilder's approach to constructing the image iiif links so that small inconsistencies don't break it, and to change the suffix used if necessary.

e-maud commented 10 months ago

Many thanks for this very detailed and useful investigation :)

It's good to have a fresh set of eyes on it, and a good time to refactor some of the code that has evolved organically; back in the day, no institution had a IIIF endpoint, and then it happened more and more, leading to code changes and a not-so-clean situation that now needs to be updated.

About the iiik-link property

My first thought is that we could specify further the iiik-link property (independently from where it is). It seems a bit under the radar, but manifest and info files are 2 different things in the IIIF APIs.
Just to be aligned (sorry if already known):

The fact that there are both info.json and manifestinfos in the same json property should be corrected. A suggestion:

Institutions may differ w.r.t how they expose and structure their newspaper collections via the Presentation API (i.e. what is a collection for them), but this can be found.

What we use (or not) and where

Presently there are two main usages of IIIF image links:

1. For the interface viewer component.

(a) For images served via the impresso image viewer, the URL construction is always the same, and Daniele knows how to build the {identifier} part of the URL given a set of information retrieved from mysql and solr. Depending on what is displayed (an article or a page), the URL is built. I think that the info.json is used by the viewer to get resolution information and adapt. In any case, the base image URL is the same for the info.json or the image itself (one of the nice thing about the Image API).

To build the {identifier}, the newspaper structure (title>issues>pages>article) is found in MySQL. To know which part of an image to show, the coordinates of token, lines, regions, and articles are found in SOLR main index.

(b) For images served via other IIIF servers, where the URL is built from (unknown) arks, Daniele cannot guess the base URL and this is stored alongside each page in MySQL. Concretely speaking the full URL until .../info.jsonis stored (comment on this below).

Three remarks here:

2. For image (=pictures on the scans) processing.

IIIF image links of image objects are stored in SOLR in the image collection. A picture is better:

image

Recap

What is the Page iiif link expected to link to?

To the image URL of the Image API. I think the base part suffice: {scheme}://{server}/{prefix}/{identifier}/.

Was a patch made for the canonical RERO data to modify it, or was it re-ingested (if so with which version of the code, I have found none that fits the current format in s3://canonical-data)?

Joker for now :)

Where can I found information about the iiif page links for SWA data?

I think Simon answered, but you can also contact Martin.

For BNF data, should the "manifest.json" or "info.json" be prefered and kept?

The info.json. I am not sure why the manifest was stored, I suppose because of terminology confusion.

And from slack:

The issues' iiif manifest urls are used in the Solr, anywhere else? What information is expected to be in the manifest?

Presently we do not use issue manifest information. But I'd say it would be good to store them in canonical. What is expected to be there is information about the digital object (of type metadata).

Where are the pages' iiif links used? Should they map directly to the image of the page or also to a manifest?

Same response as above + no, no link to a manifest.

Thank you for reaching the end of this comment :sweat_smile: - happy to discuss this orally to really be sure.

simon-clematide commented 10 months ago

My last comment would be whether adding a IIIF-Major Version to our data would allow to predict the relevant URL for retrieving image information. I think the canonical format should be self-contained (meaning not relying on external DBs). Given that the interface is only one way how we present our data. APIs are another.

piconti commented 10 months ago

Following our meeting discussing this, here were the decisions that were taken regarding iiif URIs:

In the Issue Canonical Schema:

In the Page Canonical Schema:

Documentation: Create a "IIIF Phonebook" documenting all IIIF endpoints/links etc for the various providers

Side-Note: Since multiple additions and modifications will be made to almost all existing canonical data, this may be a good opportunity to actually replace correctly the c property for images content items (i.e. inside the metadata with the iiif_link). Initial discussions/decisions leaned towards keeping it outside, since it was how downstream code expected it and all concerned already-ingested data (RERO, Olive, BNL) had it outside of the metadata already. Given the conclusions of the current discussion, I think it makes sense to reconsider this. Pros:

Cons:

e-maud commented 10 months ago

Some conclusions regarding IIIF information:

Issue level:
Addition of the manifest URI for each issue, when the institution exposes the IIIF presentation API (Presentation API 3.0 manifest documentation).

Page level As before, storage of the image URI (from the Image API), but in a more consistent way.

Content item level

Collection level No link toward the collection representation, since it can be retrieved from the manifest URI (within).

Version of the IIIF API No need to store it, it can be known by the client via the @context key which should be present in the response.

A few info on frontend

e-maud commented 10 months ago

Really sorry @piconti , I did not refresh my page and did not see your summary before commenting, now mine kinds of duplicates. But better more than not enough information (!)

piconti commented 10 months ago

No problem! Having both out summaries highlights we were not 100% clear on the iiif link for the image content items in the Issue Canonical data: Do we contruct them directly up until the default.jpg direclty?

simon-clematide commented 10 months ago

Too add my 5 cents: I would rather go for the image_base_uri consequently.

piconti commented 8 months ago

A google sheet document was created summarizing all problems that are currently in the data, including regarding this issue, as well as the fixing approach for each.

After discussion, it was decided that the mentioned properties iiif_manifest_uri and iiif_img_base_uri would only be added when necessary to the data. As a result, the following modifications will be done to the canonical data (and rebuilt data):

Upcoming new data ingestions (BCUL, BL, ONB, KB) will have both properties.

simon-clematide commented 8 months ago

@piconti Just to be sure: Will every json of a page have a valid iiif URI in the end (meaning, e.g. for reocring, we only need access to the rebuilt pages, and not look things up in mySQL?

piconti commented 8 months ago

Currently, no IIIF links are in the rebuilt data at all (I should have precised that the rebuilt data will be modified when applicable, here in the cases of reingestion).

However, yes all pages (in canonical format) will have a valid IIIF URI, either in the original iiif property, or the new iiif_img_base_uri (which should be prefered to iiif whenever available).

Is there an existing appraoch to reOCRing or other uses for which you would need the page iiif URI? If yes, the modifications to your current approach should be minimal it you fetch it from the canonical (conditionally modify the property you fetch the URI from based on the persence of iiif_img_base_uri) and none otherwise. If no, and that you would need the page's iiif link to be present in the rebuilt data, we could discuss this specifically. (We could for instance add the iiif_image_base_uri to the ppreb property of the rebuilt schema, but it would mean substantial patching as this is not present anywhere at the moment.

simon-clematide commented 8 months ago

Oops, I meant canonical pages, sorry for the confusion.