Check BCUL IIIF endpoint

e-maud commented 10 months ago

Basic info:

iiif endpoint: https://scriptorium.bcu-lausanne.ch/api/iiif/
works with title identifier, e.g. https://scriptorium.bcu-lausanne.ch/api/iiif/46165/manifest
identifier available if folder names and in scriptorium interface urls

Objectives:

[x] ask BCUL for the documentation ensure we have all necessary information to query their IIIF)
[x] test if endpoint works correctly
[x] test if word/reg coordinates are correct w.r.t to image resolutions
[x] clarify how to get page identifier in IIIF url

Some tests:

CASE Feuille d'Avis de Lausanne (FAL, 46165)

Search manifest IIIF from title id: https://scriptorium.bcu-lausanne.ch/api/iiif/46165/manifest
Page 1, with page ID 2718680 (id obtained via the manifest): https://scriptorium.bcu-lausanne.ch/api/iiif-img/2718680/full/300,/0/default.jpg
Ex 1 - region: 195,904,1796,2330
- plain copy-paste of coord: https://scriptorium.bcu-lausanne.ch/api/iiif-img/2718680/195,904,1796,2330/300,/0/default.jpg => not OK
- conversion of last elements to width and height (cf IIIF Image API spec): 195,904,1796-195=1601,2330-904=1426
  - => correct query: https://scriptorium.bcu-lausanne.ch/api/iiif-img/2718680/195,904,1601,1426/300,/0/default.jpg
Ex 2 - line "VENTES": 831,910,1277,949
- conversion: 831,910,1277-831=446,949-910=49
- => https://scriptorium.bcu-lausanne.ch/api/iiif-img/2718680/831,910,446,49/300,/0/default.jpg
Ex 3 - char "V": 831,913,867,949
- conversion:831,913,867-831=36,949-913=936
- => https://scriptorium.bcu-lausanne.ch/api/iiif-img/2718680/831,913,36,36/300,/0/default.jpg

Open questions

Some requests require authentification:
https://scriptorium.bcu-lausanne.ch/api/iiif-img/2718680/full/full/0/default.jpg requires credentials.
How to get the page identifier part of the IIIF url? Would it be possible to get it as part of the page .xml file? Or as a separate information in another way? Otherwise it get reconstructed while parsing the data, thus many queries will be sent to the IIIF endpoint, which might break.
CASE Journal de Pully (JPU) https://scriptorium.bcu-lausanne.ch/api/iiif/8228108/manifest => missing credentials

piconti commented 10 months ago

Thank you for this breakdown.

From what I understand, the way to gather the necessary information for the IIIF depends on the type of sample:

Samples where mit file is in XML, and each page filename contains the titleId, date and page number (eg. 46165):

The issue manifest's iiif url can be constructed using the issue's directory name (here 46165) and injecting it in the fillowing link: https://scriptorium.bcu-lausanne.ch/api/iiif/{ISSUE_DIR_NAME}/manifest
- here: https://scriptorium.bcu-lausanne.ch/api/iiif/46165/manifest
The page image urls can be fetched from this manifest under: manifest['sequences'][0]['canvases'][{PAGE_NUMBER}]['images'][0]['resource']['@id'].
- They are in the format https://scriptorium.bcu-lausanne.ch/api/iiif-img/{PAGE_ID}/full/300,/0/default.jpg, (here 2718680 is the first page's ID, yielding https://scriptorium.bcu-lausanne.ch/api/iiif-img/2718680/full/300,/0/default.jpg ).
- The page IDs can be fetched from the issue manifest: manifest['sequences'][0]['canvases'][{PAGE_NUMBER}]['@id'], only keeping the last part of the url.
  - here: https://scriptorium.bcu-lausanne.ch/api/iiif/46165/canvas/2718680, with page ID 2718680.

Samples where mit file is in JSON, and the pages OCR XML filenames are number IDs (eg. 388793, or 660907):

The issue manifest's iiif url can be constructed in the same way, using the issue's directory name (here 388793): https://scriptorium.bcu-lausanne.ch/api/iiif/388793/manifest
The page image urls can be constructed using the page OCR xml filename – each page filename is the corresponding ID for the iiif url.
- Here, the first page ID is 6254688: https://scriptorium.bcu-lausanne.ch/api/iiif-img/6254688/full/300,/0/default.jpg

In both cases, coordinates are in the corresponding pages OCR xml files, and exist at the region, line and token level. As mentioned above, they are in the format (left, top, right, bottom), instead of the desired (left, top, width, height).

Authentication and iiif access – open questions

Some requests seem to require credentials.
Others seem to need them, and end up working after a few tries. It's unclear why this is the case.
The page image urls only work when using full/300,/0/default.jpg (or with other values instead of 300) and also require authentication when using full/full/0/default.jpg

theophilenaito commented 10 months ago

Thank you for these notes. With respect to the missing credentials for Journal de Pully, it seems to me that you are using a page identifier instead of the issue identifier (660907). We will work on the other open questions at the begining of December in order to get answers for the meeting planned on the 11th of December. I hope this suits you

piconti commented 10 months ago

Small update on the requests sometimes authorized, other times not @theophilenaito.

During the last few days, I have used various scriptorium IIIF uris, and a pattern has emerged in when I would be authorized or unauthorized to access the Presentation or Image API. It seems that the first browser access "of the day" (probably simply a timeout somewhere) is systematically unauthorized, with the link to scriptorium's homepage proposed on the Error page. At this stage, none of the links work, all with the same error. However, if I click on the homepage link (eg. in a new tab) and wait for it to charge and reload the original unauthorized page, the content will show normally, along with any other content for which I was unauthorized. Then, from the experience of the last few days, no issues were present for the rest of my usage, but the issue comes back the next day, which is why I think there is probably a timeout.

Note I have also queried the API programmatically via HTTP requests, and encountered no issues so far, but I'm not sure if it was prior to my "first access of the day".

I hope this helps identifying what could be causing this. There is maybe some condition somewhere that verifies the user is accessing the contents via the Scriptorium website?

piconti commented 7 months ago

Update now that the API was fixed: The Scriptorium IIIF API works, but takes very long to respond. Setting time-outs of 30 seconds leads to timeout errors. This causes a significant slow-down in the processing of the pages, which means that ingesting BCUL data will be much longer than for other providers. I will continue my small tests to identify the lowest timeout possible.

Ingestion of all titles for pilot data (4083 issues in the first batch) will follow, and give us an idea of the time necessary for the ingestion.

theophilenaito commented 7 months ago

Thank you for this information, I am sorry for that. At the moment, Scriptorium in general takes very long to respond, unfortunately. We are working on it (we will focus on the IIIF API) but I can't give a timeframe for a solution. Thank you for your tests.

piconti commented 7 months ago

Hi @theophilenaito, Thank you for your response. I found a way to reduce the number of requests to do, so now things run fast enough and are not as subject to issues!

I'm sorry, I should have added another update on here. We have started the ingestion for several of the pilot titles and the others will soon follow!

While a slightly faster API response would be very nice, it's not strictly necessary and won't be a limit to the ingestion of BCUL data.

If I encounter any new problem with the API I'll let you know.

impresso / impresso-text-acquisition

Check BCUL IIIF endpoint #119