[Various Importers] Inconsistencies in iiif links and coordinates in code and JSONs

This issue is directly linked to issues #104 and #105, but after a first investigation about the current situation, it seems multiple importers present inconsistencies in where and how they create the iiif links and coordinates. As a result, a general issue covering all importers seemed more adapted. Sorry in advance, it is very long, but I felt that having the detailed situation somewhere would be useful to have a broader view.

Context

The two previous issues #104 and #105 highlighted that some importers generated invalid Issue JSONs for image content items. In particular the iiif-links were not correct, and depending on the importer, iiif-links or coordinates were misplaced in the content-item object in the output JSON.

Current state of the importer code and JSON files

1. BNF

Importer Code - Issue
- iiif_link contains the link to the JPG image (with suffix={coords}/full/0/default.jpg instead of info.json)
- iiif_link is placed inside the content item's metadata (ci['m']['iiif_link'])
- The coordinates c are not present at all in the content item
- Eg. taken from s3://canonical-data/excelsior/issues/excelsior-1910-issues.jsonl.bz2:
```
{
"m": {
   "id": "excelsior-1910-11-16-a-i0161",
   "tp": "image",
   "pp": [12],
   "t": "Publicité",
   "iiif_link": "https://gallica.bnf.fr/iiif/ark:/12148/bpt6k46000007/f12/2753,4963,3010,3273/full/0/default.jpg"
}}
```
Importer Code - Page
- The page's iiif maps to a "full" manifest, using IIIF_MANIFEST_SUFFIX = "full/full/0/manifest.json", (wich yields different results compared to using IIIF_MANIFEST_SUFFIX = "info.json").
- Eg. taken from s3://canonical-data/excelsior/pages/excelsior-1910/excelsior-1910-11-16-a-pages.jsonl.bz2: "iiif": "https://gallica.bnf.fr/iiif/ark:/12148/bpt6k46000007/f1/full/full/0/manifest.json" (compared to "https://gallica.bnf.fr/iiif/ark:/12148/bpt6k46000007/f1/info.json".)
  2. BNF-EN
Importer Code - Issue
- iiif_link contains the link to the JPG image (with suffix={coords}/full/0/default.jpg instead of info.json)
- iiif_link is placed outside the content item's metadata (ci['iiif_link'])
- c is placed inside the content item's metadata (ci['m']['c'])
- Eg. taken from s3://canonical-data/jdpl/issues/jdpl-1814-issues.jsonl.bz2:
```
{
"m": {
      "id": "jdpl-1814-05-27-a-i0014", 
      "tp": "table", "pp": [4], "t": "Untitled", 
      "c": [155, 1515, 1036, 182]
}, 
"l": { [legacy content, irrelevant here]}, 
"iiif_link": "https://gallica.bnf.fr/iiif/ark:/12148/bpt6k4209172/f4/155,1515,1036,182/full/0/default.jpg"
}
```
Importer Code - Page
- The page's iiif maps to the JPG image of the full page, using IIIF_SUFFIX = "full/full/0/default.jpg".
- Eg. taken from s3://canonical-data/jdpl/pages/jdpl-1814/jdpl-1814-05-27-a-pages.jsonl.bz2 for the page corresponding to above example: "iiif": "https://gallica.bnf.fr/iiif/ark:/12148/bpt6k4209172/f4/full/full/0/default.jpg" (compared to "https://gallica.bnf.fr/iiif/ark:/12148/bpt6k4209172/f4/info.json".)
  3. RERO
Importer Code - Issue
- iiif_link contains the link to the JPG image (with suffix={coords}/full/0/default.jpg instead of info.json)
- iiif_link is placed outside the content item's metadata (ci['iiif_link'])
- c is placed inside the content item's metadata (ci['m']['c'])
- Eg. taken from s3://playground-pauline/VHT/issues/VHT-1939-issues.jsonl.bz2 (recently generated canonical data using master branch):
```
{
"m": {
      "id": "VHT-1939-01-06-a-i0010",
      "tp": "image", "pp": [4], "t": "Untitled",
      "c": [172, 1618, 201, 226]
}, 
"l": { [legacy content, irrelevant here]}, 
"iiif_link": "https://impresso-project.ch/api/proxy/iiif/VHT-1939-01-06-a-p0004/172,1618,201,226/full/0/default.jpg"
}
```
- Note that, while the code matches the description given in issue #105, the contents of some JSON files observed as examples in s3://canonical-data don't have the same object structure, and no code performing a patch or correcting this issue was found. The differences are the following:
  - iiif_link contains the link to the manifest (with suffix=info.json), and is placed inside the content item's metadata (ci['m']['iiif_link'])
  - c is placed outside the content item's metadata (ci['c'])
  - Eg. taken from s3://canonical-data/VHT/issues/VHT-1939-issues.jsonl.bz2:
```
{
"m": {
  "id": "VHT-1939-01-06-a-i0010", 
  "tp": "image", "pp": [4], "t": "Untitled", 
  "iiif_link": "https://impresso-project.ch/api/proxy/iiif/VHT-1939-01-06-a-p0004/info.json"}, 
"l": {[legacy content, irrelevant here]}, 
"c": [172, 1618, 201, 226]
}
```
Importer Code - Page
- The page's iiif maps to a manifest, using the page's id as suffix to the impresso URL endpoint.
- Eg. taken from s3://canonical-data/VHT/pages/VHT-1939/VHT-1939-01-06-a-pages.jsonl.bz2 for the page corresponding to above example: "iiif": "https://impresso-project.ch/api/proxy/iiif/VHT-1939-01-06-a-p0001" (this is unchanged in the recently generated data.)
  4. BNL/Lux
Importer Code - Issue
- iiif_link contains the link to the JSON manifest (with suffix=info.json)
  - Note that from issue #103, the current iiif links for the BNL data are incorrect, and the intent is to correct them directly, without reingesting all the data.
- iiif_link is placed inside the content item's metadata (ci['m']['iiif_link'])
- c is placed outside the content item's metadata (ci['c'])
- Eg. taken from s3://canonical-data/tageblatt/issues/tageblatt-1913-issues.jsonl.bz2:
```
{
"m": {
      "id": "tageblatt-1913-08-09-a-i0033",
      "pp": [3], "tp": "image", "t": "rers Adam. Rechts im Hintergrund das Cham- pagnerzelt der Firma E. Mercier.",
      "iiif_link": "https://iiif.eluxemburgensia.lu/iiif/2/ark:%2f70795%2ft8mg9c%2fpages%2f3/info.json"
}, 
"l": { [legacy content, irrelevant here]}, 
"c": [111, 811, 1146, 1787]
}
```
Importer Code - Page
- The page's iiif maps to the page's JSON manifest. It should also be corrected (issue #103).
- Eg. taken from s3://canonical-data/VHT/pages/VHT-1939/VHT-1939-01-06-a-pages.jsonl.bz2 for the page corresponding to above example: "iiif": "https://iiif.eluxemburgensia.lu/iiif/2/ark:%2f70795%2ft8mg9c%2fpages%2f3/info.json" (corrected version: "https://iiif.eluxemburgensia.lu/image/iiif/2/ark:70795%2ft8mg9c%2fpages%2f3/info.json")
  5. SWA
Importer Code - Issue
- N/A, No iiif links or coordinates are added to the issue JSON
Importer Code - Page
- The page's iiif uses the filename as suffix, using IIIF_ENDPOINT_URL = "https://ub-sipi.ub.unibas.ch/impresso", but it does not seem to work.
- Eg. taken from s3://canonical-data/arbeigeber/pages/arbeitgeber-1907-01-05-a-pages.jsonl.bz2 : "iiif": "https://ub-sipi.ub.unibas.ch/impresso/BAU_1_000059110_1907_0001".
  6. Olive
Importer Code - Issue
- No iiif links are added to the issue JSON
- c is placed outside the content item's metadata (ci['c'])
- Idealy the Olive importer is not modified, and the data is not re-ingested.
Importer Code - Page
- The page's iiif maps to a manifest, using the page's id as suffix to the impresso URL endpoint. Same as for RERO.
  7. TETML (FedGaz)
Importer Code - Issue
- N/A, No iiif links or coordinates are added to the issue JSON
Importer Code - Page
- The page's iiif maps to a manifest, using the page's id as suffix to the impresso URL endpoint. Same as for RERO.

Summary

importer	CI iiif links to	`iiif_link` placement	`c` placement	Page iiif links to
BNF	JPG image of specific CI	inside metadata	None, missing	a "full" gallica manifest
BNF-EN	JPG image of specific CI	outside metadata	inside metadata	JPG image of the full page,
RERO	JPG image of specific CI (manifest in the s3)	outside metadata (inside in the s3)	inside metadata (outside in the s3)	Page manifest
Lux/BNL	JSON page manifest (but faulty)	inside metadata	outside metadata	JSON manifest (but faulty)
SWA	N/A	N/A	N/A	unkown, is faulty
Olive	N/A	N/A	outside metadata	Page manifest
Tetml	N/A	N/A	N/A	Page manifest

What is expected

The expected according to issues #104 and #105 output is

  {
   "m": {
          "id": {val},
          "pp": [{val}], "tp": "image", "t": {val},
          "iiif_link": {val},
           "c": {val}
   }, 
  }

However, the rebuilder, and impresso-images suggests that the coordinates are expected to be outside of the content metadata in the code, ie.:

  {
   "m": {
          "id": {val},
          "pp": [{val}], "tp": "image", "t": {val},
          "iiif_link": {val}
   }, 
   "c": {val}
  }

In addition, this issue suggests the coordinates should be inside the metadata, and a test was created to enforce it for RERO.
As per the Page iiif links, I have not found which module uses it, I therefore don't know if it should map to the manifest or image of the full page.

Proposed Approach

(feedback welcome)

Ideally, no canonical data that is working without issues is re-ingested, unless really necessary.
The code for all importers where relevant are matched to BNL situation, as it is the one expected in the downstream tasks. Additionally, the RERO's canonical data inside the bucket s3://canonical-data already matches it and it prevents substantial re-running.
- Note: It would be necessary to conduct a more in-depth check that all the RERO data in s3://canonical-data follows the same structure.
A function can be implemented in the rebuilder to accommodate or check for some of the cases here, as to prevent errors.
Only BNF and BNF-EN are re-ingested to correct the various issues (in particular with the page iiif link) and unify the importers. Since they are new to this release and reprensent much less data, this should be more doable.
SWA is patched to fix the page links.
A few questions remain: @mromanello @e-maud
- What is the Page iiif link expected to link to?
- Was a patch made for the canonical RERO data to modify it, or was it re-ingested (if so with which version of the code, I have found none that fits the current format in s3://canonical-data)?
- Where can I found information about the iiif page links for SWA data?
- For BNF data, should the "manifest.json" or "info.json" be prefered and kept?

Small update regarding SWA:

Notes and discussions from Impresso I were found and contain the following information: (NB: it's not assured this information is still correct).

The IIIF presentation API is at the following address: https://ub-iiifpresentation.ub.unibas.ch/impresso_sb/collection/
From there, one can access the json manifests corresponding to specific titles:
- https://ub-iiifpresentation.ub.unibas.ch/impresso_sb/collection/arbeitgeber/
- https://ub-iiifpresentation.ub.unibas.ch/impresso_sb/collection/handelsztg/
Within these manifests, the list of individual issues is under manifests, where the key id holds the link to the specific issue's JSON manifest:
- https://ub-iiifpresentation.ub.unibas.ch/impresso_sb/handelsztg-0001-a-issue/manifest/
- Each issue's JSON manifest contains metadata about the issue, including iiif links to images or Alto XMLs, but none seem to work (image iiif links yield 404 errors, Alto XML links yield 403 (forbidden) errors, and "service" links are the same as the ones contained in the canonical data).

Unfortunately, the information at disposal does not mention the links not working.

We can reach another JSON manifest by adding ".tif" to the end of the current links. Adding ".jp2" or ".png" yields 403 errors again, so it's probably that only authenticated users can access them. It is still unclear if the links in the canonical data were left as-is on purpose, or were supposed to be modified to map to the JP2000.

Thanks a lot for the thorough review of this confusing situation. Here a few comments and suggestions:

iiif is only needed for the web interface currently: The schema underdefines it completely: https://github.com/impresso/impresso-schemas/blob/master/docs/page.md#iiif
- I asked @eliaskreyenbuehl by e-mail whether he can help (he was our partner at SWA for these things)
- We need to specify to which IIIF part is expected. A simple way would be to provide the base URI of the image from the IIIF server, which typically ends with an identifier that is unique to the image. From this, we can construct a typical jpg url by appending /full/full/0/default.jpg . Or append info.json to get more information on the page. The suggestion would therefore be not to directly use a page image URL as iiif property. What doe the others @danieleguido @e-maud think?
  - the iiif in the issue.json could be different from the iiif link in the page because they serve slightly different context (issue vs single page). But the middleware/frontend might have certain expectations. On the level of issues, the IIIF manifest might be the corresponding thing. Of the level of pages, the impage info.json might be the relevant info.
  - the 'm' metadata property only requires the following keys: "id","pp", "tp" in the issue schema. So there is additional things that were kept by the converter, but they cannot be relied on. The true coordinate probably need to be fetched from the info.json. The coordinates in the issue are not really relevant. The coordinates that we can rely on are in the page jsonl.

Thank you for the feedback and informations:

Thank you, the insights concerning SWA could be very helpful indeed.
I agree that what the IIIF should map to is unclear and should be specified in more detail.
- We could indeed only keep the prefix and add the necessary suffix when necessary in downstream tasks (namely the interface). However, with new incoming data, not all IIIF enpoints are assured to work with the info.json suffix (this is the case for example for BCUL which uses "/manifest"). It's not necessarily more complex to adapt to each endpoint in the downstream tasks rather than when creating the canonical, but this is worth to take into consideration.
- Since iiif links are present both in the Issue schema (at the content item level) and page schema (at the page level), I aggree using different iiif urls for both also makes sense. For example:
  - Use info.json for the Issue JSON, so that metadata can be fetched for the interface.
    - Note: In the rebuilder script, these links are currently modified and reconstructed to have the suffix {coords}/full/0/default.jpg for each content item in the rebuilt Content Item JSONs.
  - Use the page's full image for the Page JSON, using suffix /full/full/0/default.jpg.
Indeed, more information is currently stored in the 'm' metadata property, I discussed briefly about them with @mromanello and they were kept for verification/debug purposes. The coordinates should be correct however, since they are also converted when necessary. I'm not sure of whether coordinates in the info.json also need conversion, I can look more into this.

News from SWA (translated from German): They actually adopted the newest standards... the contact person is martin.reisacher@unibas.ch . If think from our side nothing speaks against ark persistent identifiers for the image pages. As long as we can map them from the old one, it seems a trivial one-time change.

Dear Simon, dear Elias,

I wanted to earn the badge (at least halfway) :-) the redirection to info.json doesn't work correctly without the image end. I would have to have a look at this, but it could take a while, I'm not sure whether sipi or our resolver is buggy here. With info.json or .jpx it works, and the direct image links also work. This was probably not noticed because it worked / was tested in the universal viewer at the time. info.json without image link https://ub-sipi.ub.unibas.ch/impresso/BAU_1_000094152_1873_1746/info.json (refers to the jpx version) Image: https://ub-sipi.ub.unibas.ch/impresso/BAU_1_000094152_1873_1746/full/max/0/default.jpg with jpx extension: https://ub-sipi.ub.unibas.ch/impresso/BAU_1_000094152_1873_1746.jpx The references for the XML files were missing in the resolver. These should now work. But I have only made random samples https://ub-resolver.ub.unibas.ch/impresso/BAU_1_000059110_1952_1121.xml https://ub-resolver.ub.unibas.ch/impresso/BAU_1_000059110_1908_0033.xml https://ub-resolver.ub.unibas.ch/impresso/BAU_1_000094152_1873_1746.xml

In general, the Impresso project is only on my radar at the moment. We are currently rebuilding our IIIF infrastructure and would also like to introduce persistent identifiers (arks) to ensure better persistence. We are also constantly considering whether there are ways to publish the data differently (as there are always

Thank you Simon for this update and Martin's contact.

I also agree there is no issue in changing to ark identifiers, especially before SWA data is on the interface. I can contact directly Martin to have more information on when this change to ark identifiers would take place to know what would be the best approach.

Also good to know which suffix is necessary to make the current links work (in particular that SWa needs /full/max/0.. instead of /full/full/0.... This means the rebuilder needs to take into consideration which title it's rebuilding when constructing the final link.

If the iiif Servers have different urls we should maybe add a property to the data the specifies the way how to access info or images. There are probably only a few ways.

Yes there are a limited number of ways, but all the necessary information is already present in the canonical data, so it should not be necessary to add a new property.

I just updated the rebuilder's approach to constructing the image iiif links so that small inconsistencies don't break it, and to change the suffix used if necessary.

Many thanks for this very detailed and useful investigation :)

It's good to have a fresh set of eyes on it, and a good time to refactor some of the code that has evolved organically; back in the day, no institution had a IIIF endpoint, and then it happened more and more, leading to code changes and a not-so-clean situation that now needs to be updated.

About the `iiik-link` property

My first thought is that we could specify further the iiik-link property (independently from where it is). It seems a bit under the radar, but manifest and info files are 2 different things in the IIIF APIs.
Just to be aligned (sorry if already known):

the Image API (v2.1.1 and v3.0) defines, among other things:
- the Image Information URL request, serving the info.json URL, which gives information about an image file:
  {scheme}://{server}{/prefix}/{identifier}/info.json
- the image URL serving the image itself:
  {scheme}://{server}/{prefix}/{identifier}/full/{width},/0/default.jpg
the Presentation API defines, among others, the manifest which describes the structure and properties of a compound digital object. For newspapers, the 'normal' or recommended structure is:
- a newspaper title = a IIIF collection, which contains:
- issues = list of IIIF manifests, which provide metadata info (here on the issue) and contains:
- sequence of canvas, that have annotations, among other images, with their info.json and image files.

The fact that there are both info.json and manifestinfos in the same json property should be corrected. A suggestion:

at issue level: having a [iiif-manifest] property.
at page level: having a [iiif-img] property, with the base image URL ({scheme}://{server}/{prefix}/{identifier}/ , to be suffixed depending on the need.

Institutions may differ w.r.t how they expose and structure their newspaper collections via the Presentation API (i.e. what is a collection for them), but this can be found.

What we use (or not) and where

Presently there are two main usages of IIIF image links:

1. For the interface viewer component.

(a) For images served via the impresso image viewer, the URL construction is always the same, and Daniele knows how to build the {identifier} part of the URL given a set of information retrieved from mysql and solr. Depending on what is displayed (an article or a page), the URL is built. I think that the info.json is used by the viewer to get resolution information and adapt. In any case, the base image URL is the same for the info.json or the image itself (one of the nice thing about the Image API).

To build the {identifier}, the newspaper structure (title>issues>pages>article) is found in MySQL. To know which part of an image to show, the coordinates of token, lines, regions, and articles are found in SOLR main index.

(b) For images served via other IIIF servers, where the URL is built from (unknown) arks, Daniele cannot guess the base URL and this is stored alongside each page in MySQL. Concretely speaking the full URL until .../info.jsonis stored (comment on this below).

Three remarks here:

in MySQL, the page table property where the info.json URL is stored is wrongly called "manifest". This is confusing but not a big drama. There should be a manifest property on the issue table, but we do not really need it.
After having introduced the storage of info.json URL in MySQL to cover the BNL case, the page mysql records whose image are on the impresso image server (a) where also populated with a base URL. I do not know if Daniele kept the 2 ways of getting the image url or just one, I think better to ask in slack (TL;TR here).
In MySQL, in the manifest page property, both types of URL are stored: full URL ({scheme}://{server}{/prefix}/{identifier}/info.json) for BNL data, without prefix for CH data ({scheme}://{server}{/prefix}/{identifier}/. Not sure what is best, perhaps it's quicker for the front-end to already have the .json, to be asked again (all this was built from scratch and assembled little by little as needs/things were coming, now we have the occasion to look back and consolidate).

2. For image (=pictures on the scans) processing.

IIIF image links of image objects are stored in SOLR in the image collection. A picture is better:

Recap

The canonical is the source of everything:
- MySQL newspaper structure is fed from canonical.
- SOLR coordinates info of content items is fed from rebuilt, done from the canonical.
- SOLR image collection link is fed from canonical.
What is necessary for what:
- MySQL needs the image base URL ({scheme}://{server}/{prefix}/{identifier}/) at page level.
- SOLR image collection needs the image base URL and the coordinate of the content item of type i.
To answer the questions (finally ;):

What is the Page iiif link expected to link to?

To the image URL of the Image API. I think the base part suffice: {scheme}://{server}/{prefix}/{identifier}/.

Was a patch made for the canonical RERO data to modify it, or was it re-ingested (if so with which version of the code, I have found none that fits the current format in s3://canonical-data)?

Joker for now :)

Where can I found information about the iiif page links for SWA data?

I think Simon answered, but you can also contact Martin.

For BNF data, should the "manifest.json" or "info.json" be prefered and kept?

The info.json. I am not sure why the manifest was stored, I suppose because of terminology confusion.

And from slack:

The issues' iiif manifest urls are used in the Solr, anywhere else? What information is expected to be in the manifest?

Presently we do not use issue manifest information. But I'd say it would be good to store them in canonical. What is expected to be there is information about the digital object (of type metadata).

Where are the pages' iiif links used? Should they map directly to the image of the page or also to a manifest?

Same response as above + no, no link to a manifest.

Thank you for reaching the end of this comment :sweat_smile: - happy to discuss this orally to really be sure.

My last comment would be whether adding a IIIF-Major Version to our data would allow to predict the relevant URL for retrieving image information. I think the canonical format should be self-contained (meaning not relying on external DBs). Given that the interface is only one way how we present our data. APIs are another.

Following our meeting discussing this, here were the decisions that were taken regarding iiif URIs:

In the Issue Canonical Schema:

Add the property iiif_manifest_uri at the top level, when the corresponding IIIF server has a presentation API. It contains the IIIF presentation URI for the issue.
The iiif_link property stays defined at the content-item (metadata) level when the given content item is an image. It contains the IIIF image URI: {scheme}://{server}/{prefix}/{identifier}/info.json.

In the Page Canonical Schema:

Add the property iiif_img_base_uri at the top level. It contains the IIIF image base URI: {scheme}://{server}/{prefix}/{identifier} and is then modified downstream for the various tasks.
The old iiif property is kept as depreciated.

Documentation: Create a "IIIF Phonebook" documenting all IIIF endpoints/links etc for the various providers

Side-Note: Since multiple additions and modifications will be made to almost all existing canonical data, this may be a good opportunity to actually replace correctly the c property for images content items (i.e. inside the metadata with the iiif_link). Initial discussions/decisions leaned towards keeping it outside, since it was how downstream code expected it and all concerned already-ingested data (RERO, Olive, BNL) had it outside of the metadata already. Given the conclusions of the current discussion, I think it makes sense to reconsider this. Pros:

Uniformity of canonical data, with iiif_link and c properties at the same level in the schema (more intuitive)
Taking the opportunity of the other changes to correct the canonical data at large scale.

Cons:

Would represent quite a large patching of data, since a lot of data is concerned.
Already other modifications will be done to the data, might get confusing which data is being changed in what way.
BNL data will be probably re-ingested anyway since it has been re-OCRised.
Downstream code needs to be changed.

Some conclusions regarding IIIF information:

Issue level:
Addition of the manifest URI for each issue, when the institution exposes the IIIF presentation API (Presentation API 3.0 manifest documentation).

Property name: iiif_manifest_uri.
Value: Manifest URI.
Example: from e-newspaperarchives, from ONB, from BnF.

Page level As before, storage of the image URI (from the Image API), but in a more consistent way.

Property name: iiif_img_base_uri.
Value: only the base ({scheme}://{server}/{prefix}/{identifier}), no info.json or default.jpg

Content item level

For all content items, coordinates are stored as before.
For content items of type image, the complete image URI with the coordinates of the items is built. This is used by the ingestion code to the image solr collection. If not possible, the link is reconstructed on the fly from the coordinates and the base URI.

Collection level No link toward the collection representation, since it can be retrieved from the manifest URI (within).

Version of the IIIF API No need to store it, it can be known by the client via the @context key which should be present in the response.

A few info on frontend

the base image URI is retrieved from MySQL. No need to have the info.json suffix.
the complete image URI for content items of type image is retrieved from solr image collection.

Really sorry @piconti , I did not refresh my page and did not see your summary before commenting, now mine kinds of duplicates. But better more than not enough information (!)

No problem! Having both out summaries highlights we were not 100% clear on the iiif link for the image content items in the Issue Canonical data: Do we contruct them directly up until the default.jpg direclty?

If yes:
- Code in impresso-images would need modifications
- Current canonical data for RERO and BNL would need modification
If no (keeping info.json):
- Ingestion code for image Solr collection would need modification, probably similar to these.
- Slightly less efficient to reconstruct links twice

Too add my 5 cents: I would rather go for the image_base_uri consequently.

A google sheet document was created summarizing all problems that are currently in the data, including regarding this issue, as well as the fixing approach for each.

After discussion, it was decided that the mentioned properties iiif_manifest_uri and iiif_img_base_uri would only be added when necessary to the data. As a result, the following modifications will be done to the canonical data (and rebuilt data):

[x] BNL: Reingestion to correct the various issues.
[x] BNF: Reingestion to correct the various issues.
[x] BNF-EN: Reingestion to correct the various issues.
[x] SWA: Patch at the issue-level to add the iiif_manifest_uri property.
[x] FedGaz: Patch at the page-level to add the iiif_img_base_uri property (iiif was missing).

Upcoming new data ingestions (BCUL, BL, ONB, KB) will have both properties.

@piconti Just to be sure: Will every json of a page have a valid iiif URI in the end (meaning, e.g. for reocring, we only need access to the rebuilt pages, and not look things up in mySQL?

Currently, no IIIF links are in the rebuilt data at all (I should have precised that the rebuilt data will be modified when applicable, here in the cases of reingestion).

However, yes all pages (in canonical format) will have a valid IIIF URI, either in the original iiif property, or the new iiif_img_base_uri (which should be prefered to iiif whenever available).

Is there an existing appraoch to reOCRing or other uses for which you would need the page iiif URI? If yes, the modifications to your current approach should be minimal it you fetch it from the canonical (conditionally modify the property you fetch the URI from based on the persence of iiif_img_base_uri) and none otherwise. If no, and that you would need the page's iiif link to be present in the rebuilt data, we could discuss this specifically. (We could for instance add the iiif_image_base_uri to the ppreb property of the rebuilt schema, but it would mean substantial patching as this is not present anywhere at the moment.

Oops, I meant canonical pages, sorry for the confusion.

impresso / impresso-text-acquisition