impresso / impresso-text-acquisition

Python library to import OCR data in various formats into the canonical JSON format defined by the Impresso project.
https://impresso.github.io/impresso-text-acquisition/
GNU Affero General Public License v3.0
7 stars 2 forks source link

Check BCUL IIIF endpoint #119

Closed e-maud closed 6 months ago

e-maud commented 10 months ago

Basic info:

Objectives:

Some tests:

CASE Feuille d'Avis de Lausanne (FAL, 46165)

Open questions

piconti commented 10 months ago

Thank you for this breakdown.

From what I understand, the way to gather the necessary information for the IIIF depends on the type of sample:

Samples where mit file is in XML, and each page filename contains the titleId, date and page number (eg. 46165):

Samples where mit file is in JSON, and the pages OCR XML filenames are number IDs (eg. 388793, or 660907):

In both cases, coordinates are in the corresponding pages OCR xml files, and exist at the region, line and token level. As mentioned above, they are in the format (left, top, right, bottom), instead of the desired (left, top, width, height).

Authentication and iiif access – open questions

theophilenaito commented 10 months ago

Thank you for these notes. With respect to the missing credentials for Journal de Pully, it seems to me that you are using a page identifier instead of the issue identifier (660907). We will work on the other open questions at the begining of December in order to get answers for the meeting planned on the 11th of December. I hope this suits you

piconti commented 10 months ago

Small update on the requests sometimes authorized, other times not @theophilenaito.

During the last few days, I have used various scriptorium IIIF uris, and a pattern has emerged in when I would be authorized or unauthorized to access the Presentation or Image API. It seems that the first browser access "of the day" (probably simply a timeout somewhere) is systematically unauthorized, with the link to scriptorium's homepage proposed on the Error page. At this stage, none of the links work, all with the same error. However, if I click on the homepage link (eg. in a new tab) and wait for it to charge and reload the original unauthorized page, the content will show normally, along with any other content for which I was unauthorized. Then, from the experience of the last few days, no issues were present for the rest of my usage, but the issue comes back the next day, which is why I think there is probably a timeout.

Note I have also queried the API programmatically via HTTP requests, and encountered no issues so far, but I'm not sure if it was prior to my "first access of the day".

I hope this helps identifying what could be causing this. There is maybe some condition somewhere that verifies the user is accessing the contents via the Scriptorium website?

piconti commented 7 months ago

Update now that the API was fixed: The Scriptorium IIIF API works, but takes very long to respond. Setting time-outs of 30 seconds leads to timeout errors. This causes a significant slow-down in the processing of the pages, which means that ingesting BCUL data will be much longer than for other providers. I will continue my small tests to identify the lowest timeout possible.

Ingestion of all titles for pilot data (4083 issues in the first batch) will follow, and give us an idea of the time necessary for the ingestion.

theophilenaito commented 7 months ago

Thank you for this information, I am sorry for that. At the moment, Scriptorium in general takes very long to respond, unfortunately. We are working on it (we will focus on the IIIF API) but I can't give a timeframe for a solution. Thank you for your tests.

piconti commented 7 months ago

Hi @theophilenaito, Thank you for your response. I found a way to reduce the number of requests to do, so now things run fast enough and are not as subject to issues!

I'm sorry, I should have added another update on here. We have started the ingestion for several of the pilot titles and the others will soon follow!

While a slightly faster API response would be very nice, it's not strictly necessary and won't be a limit to the ingestion of BCUL data.

If I encounter any new problem with the API I'll let you know.