Closed e-maud closed 6 months ago
Thank you for this breakdown.
From what I understand, the way to gather the necessary information for the IIIF depends on the type of sample:
Samples where mit file is in XML, and each page filename contains the titleId, date and page number (eg. 46165):
https://scriptorium.bcu-lausanne.ch/api/iiif/{ISSUE_DIR_NAME}/manifest
manifest['sequences'][0]['canvases'][{PAGE_NUMBER}]['images'][0]['resource']['@id']
.
https://scriptorium.bcu-lausanne.ch/api/iiif-img/{PAGE_ID}/full/300,/0/default.jpg
, (here 2718680 is the first page's ID, yielding https://scriptorium.bcu-lausanne.ch/api/iiif-img/2718680/full/300,/0/default.jpg ).manifest['sequences'][0]['canvases'][{PAGE_NUMBER}]['@id']
, only keeping the last part of the url.
Samples where mit file is in JSON, and the pages OCR XML filenames are number IDs (eg. 388793, or 660907):
In both cases, coordinates are in the corresponding pages OCR xml files, and exist at the region, line and token level.
As mentioned above, they are in the format (left, top, right, bottom)
, instead of the desired (left, top, width, height)
.
Authentication and iiif access – open questions
full/300,/0/default.jpg
(or with other values instead of 300) and also require authentication when using full/full/0/default.jpg
Thank you for these notes. With respect to the missing credentials for Journal de Pully, it seems to me that you are using a page identifier instead of the issue identifier (660907). We will work on the other open questions at the begining of December in order to get answers for the meeting planned on the 11th of December. I hope this suits you
Small update on the requests sometimes authorized, other times not @theophilenaito.
During the last few days, I have used various scriptorium IIIF uris, and a pattern has emerged in when I would be authorized or unauthorized to access the Presentation or Image API. It seems that the first browser access "of the day" (probably simply a timeout somewhere) is systematically unauthorized, with the link to scriptorium's homepage proposed on the Error page. At this stage, none of the links work, all with the same error. However, if I click on the homepage link (eg. in a new tab) and wait for it to charge and reload the original unauthorized page, the content will show normally, along with any other content for which I was unauthorized. Then, from the experience of the last few days, no issues were present for the rest of my usage, but the issue comes back the next day, which is why I think there is probably a timeout.
Note I have also queried the API programmatically via HTTP requests, and encountered no issues so far, but I'm not sure if it was prior to my "first access of the day".
I hope this helps identifying what could be causing this. There is maybe some condition somewhere that verifies the user is accessing the contents via the Scriptorium website?
Update now that the API was fixed: The Scriptorium IIIF API works, but takes very long to respond. Setting time-outs of 30 seconds leads to timeout errors. This causes a significant slow-down in the processing of the pages, which means that ingesting BCUL data will be much longer than for other providers. I will continue my small tests to identify the lowest timeout possible.
Ingestion of all titles for pilot data (4083 issues in the first batch) will follow, and give us an idea of the time necessary for the ingestion.
Thank you for this information, I am sorry for that. At the moment, Scriptorium in general takes very long to respond, unfortunately. We are working on it (we will focus on the IIIF API) but I can't give a timeframe for a solution. Thank you for your tests.
Hi @theophilenaito, Thank you for your response. I found a way to reduce the number of requests to do, so now things run fast enough and are not as subject to issues!
I'm sorry, I should have added another update on here. We have started the ingestion for several of the pilot titles and the others will soon follow!
While a slightly faster API response would be very nice, it's not strictly necessary and won't be a limit to the ingestion of BCUL data.
If I encounter any new problem with the API I'll let you know.
Basic info:
Objectives:
Some tests:
CASE Feuille d'Avis de Lausanne (FAL, 46165)
Search manifest IIIF from title id: https://scriptorium.bcu-lausanne.ch/api/iiif/46165/manifest
Page 1, with page ID 2718680 (id obtained via the manifest): https://scriptorium.bcu-lausanne.ch/api/iiif-img/2718680/full/300,/0/default.jpg
Ex 1 - region: 195,904,1796,2330
Ex 2 - line "VENTES": 831,910,1277,949
Ex 3 - char "V": 831,913,867,949
Open questions
Some requests require authentification:
https://scriptorium.bcu-lausanne.ch/api/iiif-img/2718680/full/full/0/default.jpg requires credentials.
How to get the page identifier part of the IIIF url? Would it be possible to get it as part of the page .xml file? Or as a separate information in another way? Otherwise it get reconstructed while parsing the data, thus many queries will be sent to the IIIF endpoint, which might break.
CASE Journal de Pully (JPU) https://scriptorium.bcu-lausanne.ch/api/iiif/8228108/manifest => missing credentials