Closed piconti closed 4 months ago
Small update regarding SWA:
Notes and discussions from Impresso I were found and contain the following information: (NB: it's not assured this information is still correct).
manifests
, where the key id
holds the link to the specific issue's JSON manifest:
Unfortunately, the information at disposal does not mention the links not working.
We can reach another JSON manifest by adding ".tif" to the end of the current links. Adding ".jp2" or ".png" yields 403 errors again, so it's probably that only authenticated users can access them. It is still unclear if the links in the canonical data were left as-is on purpose, or were supposed to be modified to map to the JP2000.
Thanks a lot for the thorough review of this confusing situation. Here a few comments and suggestions:
Thank you for the feedback and informations:
info.json
suffix (this is the case for example for BCUL which uses "/manifest"). It's not necessarily more complex to adapt to each endpoint in the downstream tasks rather than when creating the canonical, but this is worth to take into consideration. info.json
for the Issue JSON, so that metadata can be fetched for the interface.
{coords}/full/0/default.jpg
for each content item in the rebuilt Content Item JSONs./full/full/0/default.jpg
.News from SWA (translated from German): They actually adopted the newest standards... the contact person is martin.reisacher@unibas.ch . If think from our side nothing speaks against ark persistent identifiers for the image pages. As long as we can map them from the old one, it seems a trivial one-time change.
Dear Simon, dear Elias,
I wanted to earn the badge (at least halfway) :-) the redirection to info.json doesn't work correctly without the image end. I would have to have a look at this, but it could take a while, I'm not sure whether sipi or our resolver is buggy here. With info.json or .jpx it works, and the direct image links also work. This was probably not noticed because it worked / was tested in the universal viewer at the time. info.json without image link https://ub-sipi.ub.unibas.ch/impresso/BAU_1_000094152_1873_1746/info.json (refers to the jpx version) Image: https://ub-sipi.ub.unibas.ch/impresso/BAU_1_000094152_1873_1746/full/max/0/default.jpg with jpx extension: https://ub-sipi.ub.unibas.ch/impresso/BAU_1_000094152_1873_1746.jpx The references for the XML files were missing in the resolver. These should now work. But I have only made random samples https://ub-resolver.ub.unibas.ch/impresso/BAU_1_000059110_1952_1121.xml https://ub-resolver.ub.unibas.ch/impresso/BAU_1_000059110_1908_0033.xml https://ub-resolver.ub.unibas.ch/impresso/BAU_1_000094152_1873_1746.xml
In general, the Impresso project is only on my radar at the moment. We are currently rebuilding our IIIF infrastructure and would also like to introduce persistent identifiers (arks) to ensure better persistence. We are also constantly considering whether there are ways to publish the data differently (as there are always
Thank you Simon for this update and Martin's contact.
I also agree there is no issue in changing to ark identifiers, especially before SWA data is on the interface. I can contact directly Martin to have more information on when this change to ark identifiers would take place to know what would be the best approach.
Also good to know which suffix is necessary to make the current links work (in particular that SWa needs /full/max/0..
instead of /full/full/0...
.
This means the rebuilder needs to take into consideration which title it's rebuilding when constructing the final link.
If the iiif Servers have different urls we should maybe add a property to the data the specifies the way how to access info or images. There are probably only a few ways.
Yes there are a limited number of ways, but all the necessary information is already present in the canonical data, so it should not be necessary to add a new property.
I just updated the rebuilder's approach to constructing the image iiif links so that small inconsistencies don't break it, and to change the suffix used if necessary.
Many thanks for this very detailed and useful investigation :)
It's good to have a fresh set of eyes on it, and a good time to refactor some of the code that has evolved organically; back in the day, no institution had a IIIF endpoint, and then it happened more and more, leading to code changes and a not-so-clean situation that now needs to be updated.
iiik-link
propertyMy first thought is that we could specify further the iiik-link
property (independently from where it is). It seems a bit under the radar, but manifest and info files are 2 different things in the IIIF APIs.
Just to be aligned (sorry if already known):
{scheme}://{server}{/prefix}/{identifier}/info.json
{scheme}://{server}/{prefix}/{identifier}/full/{width},/0/default.jpg
canvas
, that have annotations, among other images, with their info.json
and image files.The fact that there are both info.json
and manifest
infos in the same json property should be corrected.
A suggestion:
[iiif-manifest]
property. [iiif-img]
property, with the base image URL ({scheme}://{server}/{prefix}/{identifier}/
, to be suffixed depending on the need.Institutions may differ w.r.t how they expose and structure their newspaper collections via the Presentation API (i.e. what is a collection for them), but this can be found.
Presently there are two main usages of IIIF image links:
1. For the interface viewer component.
(a) For images served via the impresso image viewer, the URL construction is always the same, and Daniele knows how to build the {identifier}
part of the URL given a set of information retrieved from mysql and solr. Depending on what is displayed (an article or a page), the URL is built. I think that the info.json is used by the viewer to get resolution information and adapt. In any case, the base image URL is the same for the info.json or the image itself (one of the nice thing about the Image API).
To build the {identifier}
, the newspaper structure (title>issues>pages>article) is found in MySQL.
To know which part of an image to show, the coordinates of token, lines, regions, and articles are found in SOLR main index.
(b) For images served via other IIIF servers, where the URL is built from (unknown) arks, Daniele cannot guess the base URL and this is stored alongside each page in MySQL. Concretely speaking the full URL until .../info.json
is stored (comment on this below).
Three remarks here:
in MySQL, the page
table property where the info.json
URL is stored is wrongly called "manifest". This is confusing but not a big drama. There should be a manifest
property on the issue
table, but we do not really need it.
After having introduced the storage of info.json
URL in MySQL to cover the BNL case, the page mysql records whose image are on the impresso image server (a) where also populated with a base URL. I do not know if Daniele kept the 2 ways of getting the image url or just one, I think better to ask in slack (TL;TR here).
In MySQL, in the manifest
page property, both types of URL are stored: full URL ({scheme}://{server}{/prefix}/{identifier}/info.json
) for BNL data, without prefix for CH data ({scheme}://{server}{/prefix}/{identifier}/
. Not sure what is best, perhaps it's quicker for the front-end to already have the .json, to be asked again (all this was built from scratch and assembled little by little as needs/things were coming, now we have the occasion to look back and consolidate).
2. For image (=pictures on the scans) processing.
IIIF image links of image objects are stored in SOLR in the image collection. A picture is better:
canonical
is the source of everything:
canonical
.rebuilt
, done from the canonical
.canonical
.{scheme}://{server}/{prefix}/{identifier}/
) at page level. i
.What is the Page iiif link expected to link to?
To the image URL of the Image API. I think the base part suffice: {scheme}://{server}/{prefix}/{identifier}/
.
Was a patch made for the canonical RERO data to modify it, or was it re-ingested (if so with which version of the code, I have found none that fits the current format in s3://canonical-data)?
Joker for now :)
Where can I found information about the iiif page links for SWA data?
I think Simon answered, but you can also contact Martin.
For BNF data, should the "manifest.json" or "info.json" be prefered and kept?
The info.json
. I am not sure why the manifest was stored, I suppose because of terminology confusion.
And from slack:
The issues' iiif manifest urls are used in the Solr, anywhere else? What information is expected to be in the manifest?
Presently we do not use issue manifest information. But I'd say it would be good to store them in canonical. What is expected to be there is information about the digital object (of type metadata).
Where are the pages' iiif links used? Should they map directly to the image of the page or also to a manifest?
Same response as above + no, no link to a manifest.
Thank you for reaching the end of this comment :sweat_smile: - happy to discuss this orally to really be sure.
My last comment would be whether adding a IIIF-Major Version to our data would allow to predict the relevant URL for retrieving image information. I think the canonical format should be self-contained (meaning not relying on external DBs). Given that the interface is only one way how we present our data. APIs are another.
Following our meeting discussing this, here were the decisions that were taken regarding iiif URIs:
In the Issue Canonical Schema:
iiif_manifest_uri
at the top level, when the corresponding IIIF server has a presentation API. It contains the IIIF presentation URI for the issue.iiif_link
property stays defined at the content-item (metadata) level when the given content item is an image. It contains the IIIF image URI: {scheme}://{server}/{prefix}/{identifier}/info.json
.In the Page Canonical Schema:
iiif_img_base_uri
at the top level. It contains the IIIF image base URI: {scheme}://{server}/{prefix}/{identifier}
and is then modified downstream for the various tasks.iiif
property is kept as depreciated.Documentation: Create a "IIIF Phonebook" documenting all IIIF endpoints/links etc for the various providers
Side-Note: Since multiple additions and modifications will be made to almost all existing canonical data, this may be a good opportunity to actually replace correctly the c
property for images content items (i.e. inside the metadata with the iiif_link
).
Initial discussions/decisions leaned towards keeping it outside, since it was how downstream code expected it and all concerned already-ingested data (RERO, Olive, BNL) had it outside of the metadata already.
Given the conclusions of the current discussion, I think it makes sense to reconsider this.
Pros:
iiif_link
and c
properties at the same level in the schema (more intuitive)Cons:
Issue level:
Addition of the manifest URI for each issue, when the institution exposes the IIIF presentation API (Presentation API 3.0 manifest documentation).
iiif_manifest_uri
.Page level As before, storage of the image URI (from the Image API), but in a more consistent way.
iiif_img_base_uri
.{scheme}://{server}/{prefix}/{identifier}
), no info.json or default.jpgContent item level
Collection level
No link toward the collection representation, since it can be retrieved from the manifest URI (within
).
Version of the IIIF API No need to store it, it can be known by the client via the @context key which should be present in the response.
A few info on frontend
Really sorry @piconti , I did not refresh my page and did not see your summary before commenting, now mine kinds of duplicates. But better more than not enough information (!)
No problem! Having both out summaries highlights we were not 100% clear on the iiif link for the image content items in the Issue Canonical data: Do we contruct them directly up until the default.jpg direclty?
Too add my 5 cents: I would rather go for the image_base_uri consequently.
A google sheet document was created summarizing all problems that are currently in the data, including regarding this issue, as well as the fixing approach for each.
After discussion, it was decided that the mentioned properties iiif_manifest_uri
and iiif_img_base_uri
would only be added when necessary to the data.
As a result, the following modifications will be done to the canonical data (and rebuilt data):
iiif_manifest_uri
property.iiif_img_base_uri
property (iiif
was missing).Upcoming new data ingestions (BCUL, BL, ONB, KB) will have both properties.
@piconti Just to be sure: Will every json of a page have a valid iiif URI in the end (meaning, e.g. for reocring, we only need access to the rebuilt pages, and not look things up in mySQL?
Currently, no IIIF links are in the rebuilt data at all (I should have precised that the rebuilt data will be modified when applicable, here in the cases of reingestion).
However, yes all pages (in canonical format) will have a valid IIIF URI, either in the original iiif
property, or the new iiif_img_base_uri
(which should be prefered to iiif
whenever available).
Is there an existing appraoch to reOCRing or other uses for which you would need the page iiif URI?
If yes, the modifications to your current approach should be minimal it you fetch it from the canonical (conditionally modify the property you fetch the URI from based on the persence of iiif_img_base_uri
) and none otherwise.
If no, and that you would need the page's iiif link to be present in the rebuilt data, we could discuss this specifically. (We could for instance add the iiif_image_base_uri
to the ppreb
property of the rebuilt schema, but it would mean substantial patching as this is not present anywhere at the moment.
Oops, I meant canonical pages, sorry for the confusion.
This issue is directly linked to issues #104 and #105, but after a first investigation about the current situation, it seems multiple importers present inconsistencies in where and how they create the iiif links and coordinates. As a result, a general issue covering all importers seemed more adapted. Sorry in advance, it is very long, but I felt that having the detailed situation somewhere would be useful to have a broader view.
Context
The two previous issues #104 and #105 highlighted that some importers generated invalid Issue JSONs for image content items. In particular the iiif-links were not correct, and depending on the importer, iiif-links or coordinates were misplaced in the content-item object in the output JSON.
Current state of the importer code and JSON files
1. BNF
iiif_link
contains the link to the JPG image (with suffix={coords}/full/0/default.jpg
instead ofinfo.json
)iiif_link
is placed inside the content item's metadata (ci['m']['iiif_link']
)c
are not present at all in the content items3://canonical-data/excelsior/issues/excelsior-1910-issues.jsonl.bz2
:iiif
maps to a "full" manifest, usingIIIF_MANIFEST_SUFFIX = "full/full/0/manifest.json"
, (wich yields different results compared to usingIIIF_MANIFEST_SUFFIX = "info.json"
).s3://canonical-data/excelsior/pages/excelsior-1910/excelsior-1910-11-16-a-pages.jsonl.bz2
: "iiif": "https://gallica.bnf.fr/iiif/ark:/12148/bpt6k46000007/f1/full/full/0/manifest.json" (compared to "https://gallica.bnf.fr/iiif/ark:/12148/bpt6k46000007/f1/info.json".)2. BNF-EN
iiif_link
contains the link to the JPG image (with suffix={coords}/full/0/default.jpg
instead ofinfo.json
)iiif_link
is placed outside the content item's metadata (ci['iiif_link']
)c
is placed inside the content item's metadata (ci['m']['c']
)s3://canonical-data/jdpl/issues/jdpl-1814-issues.jsonl.bz2
:iiif
maps to the JPG image of the full page, usingIIIF_SUFFIX = "full/full/0/default.jpg"
.s3://canonical-data/jdpl/pages/jdpl-1814/jdpl-1814-05-27-a-pages.jsonl.bz2
for the page corresponding to above example: "iiif": "https://gallica.bnf.fr/iiif/ark:/12148/bpt6k4209172/f4/full/full/0/default.jpg" (compared to "https://gallica.bnf.fr/iiif/ark:/12148/bpt6k4209172/f4/info.json".)3. RERO
iiif_link
contains the link to the JPG image (with suffix={coords}/full/0/default.jpg
instead ofinfo.json
)iiif_link
is placed outside the content item's metadata (ci['iiif_link']
)c
is placed inside the content item's metadata (ci['m']['c']
)s3://playground-pauline/VHT/issues/VHT-1939-issues.jsonl.bz2
(recently generated canonical data using master branch):s3://canonical-data
don't have the same object structure, and no code performing a patch or correcting this issue was found. The differences are the following:iiif_link
contains the link to the manifest (with suffix=info.json
), and is placed inside the content item's metadata (ci['m']['iiif_link']
)c
is placed outside the content item's metadata (ci['c']
)s3://canonical-data/VHT/issues/VHT-1939-issues.jsonl.bz2
:iiif
maps to a manifest, using the page'sid
as suffix to the impresso URL endpoint.s3://canonical-data/VHT/pages/VHT-1939/VHT-1939-01-06-a-pages.jsonl.bz2
for the page corresponding to above example: "iiif": "https://impresso-project.ch/api/proxy/iiif/VHT-1939-01-06-a-p0001" (this is unchanged in the recently generated data.)4. BNL/Lux
iiif_link
contains the link to the JSON manifest (with suffix=info.json
)iiif_link
is placed inside the content item's metadata (ci['m']['iiif_link']
)c
is placed outside the content item's metadata (ci['c']
)s3://canonical-data/tageblatt/issues/tageblatt-1913-issues.jsonl.bz2
:iiif
maps to the page's JSON manifest. It should also be corrected (issue #103).s3://canonical-data/VHT/pages/VHT-1939/VHT-1939-01-06-a-pages.jsonl.bz2
for the page corresponding to above example: "iiif": "https://iiif.eluxemburgensia.lu/iiif/2/ark:%2f70795%2ft8mg9c%2fpages%2f3/info.json" (corrected version: "https://iiif.eluxemburgensia.lu/image/iiif/2/ark:70795%2ft8mg9c%2fpages%2f3/info.json")5. SWA
iiif
uses the filename as suffix, usingIIIF_ENDPOINT_URL = "https://ub-sipi.ub.unibas.ch/impresso"
, but it does not seem to work.s3://canonical-data/arbeigeber/pages/arbeitgeber-1907-01-05-a-pages.jsonl.bz2
: "iiif": "https://ub-sipi.ub.unibas.ch/impresso/BAU_1_000059110_1907_0001".6. Olive
c
is placed outside the content item's metadata (ci['c']
)iiif
maps to a manifest, using the page'sid
as suffix to the impresso URL endpoint. Same as for RERO.7. TETML (FedGaz)
iiif
maps to a manifest, using the page'sid
as suffix to the impresso URL endpoint. Same as for RERO.Summary
iiif_link
placementc
placementWhat is expected
The expected according to issues #104 and #105 output is
However, the rebuilder, and impresso-images suggests that the coordinates are expected to be outside of the content metadata in the code, ie.:
In addition, this issue suggests the coordinates should be inside the metadata, and a test was created to enforce it for RERO.
As per the Page iiif links, I have not found which module uses it, I therefore don't know if it should map to the manifest or image of the full page.
Proposed Approach
(feedback welcome)
Ideally, no canonical data that is working without issues is re-ingested, unless really necessary.
The code for all importers where relevant are matched to BNL situation, as it is the one expected in the downstream tasks. Additionally, the RERO's canonical data inside the bucket
s3://canonical-data
already matches it and it prevents substantial re-running.s3://canonical-data
follows the same structure.A function can be implemented in the rebuilder to accommodate or check for some of the cases here, as to prevent errors.
Only BNF and BNF-EN are re-ingested to correct the various issues (in particular with the page iiif link) and unify the importers. Since they are new to this release and reprensent much less data, this should be more doable.
SWA is patched to fix the page links.
A few questions remain: @mromanello @e-maud
s3://canonical-data
)?