impresso / impresso-text-acquisition

🛠️ Python library to import OCR data in various formats into the canonical JSON format defined by the Impresso project.
https://impresso.github.io/impresso-text-acquisition/
GNU Affero General Public License v3.0
7 stars 2 forks source link

missing IIIF links for some BNF-EN newspaper issue #102

Open mromanello opened 4 years ago

mromanello commented 4 years ago

(issue reported by @ehoelzl)

Note from @ehoelzl on retrieval of IIIF links:

Basically there is a URL per journal (e.g. https://gallica.bnf.fr/services/Issues?ark={ark}/date) where the ark differs per journal. This URL gives us all the years for this journal. Then we can query for each year (https://gallica.bnf.fr/services/Issues?ark={ark}/date&date={year}) and get the ark link for each issue. Examples: https://gallica.bnf.fr/services/Issues?ark=cb39294634r/date and https://gallica.bnf.fr/services/Issues?ark=cb39294634r/date&date=1932


jdpl-1932-12-30-b
jdpl-1932-12-30-c
jdpl-1933-07-01-b
legaulois-1899-08-07-b
legaulois-1899-08-14-b
legaulois-1899-08-16-b
legaulois-1899-08-17-b
legaulois-1899-08-18-b
legaulois-1899-08-19-b
legaulois-1899-08-21-b
legaulois-1899-08-25-b
legaulois-1899-08-26-b
legaulois-1899-08-28-b
legaulois-1899-08-29-b
legaulois-1899-08-30-b
legaulois-1899-08-31-b
legaulois-1899-09-04-b
legaulois-1899-09-05-b
legaulois-1899-09-07-b
legaulois-1899-09-09-b
lepetitparisien-1896-03-12-a
oecaen-1914-08-26-a
oecaen-1914-08-27-a
oecaen-1914-08-28-a
oecaen-1914-08-29-a
oecaen-1914-08-30-a
oecaen-1914-08-31-a
oecaen-1939-08-29-a
oecaen-1940-04-24-a
oecaen-1940-04-25-a
oerennes-1901-12-19-a
oerennes-1901-12-30-a
lepji-1896-05-24-a
lepji-1909-12-05-a
mromanello commented 4 years ago

This issue may have occurred due to a performance issue on BNF's IIIF servers, thus when running again the importer it may disappear.

piconti commented 1 year ago

Update on the current status of availability of the issues on the Gallica iiif API.

It seems that while the issues and corresponding files exist in our data, they are simply missing from the Gallica online collection (cannot be found on the API or website). At the individual issue level, may be due to an error internal to Gallica or BNF, but it's not possible to access the iiif link or arkid.

This seems to be the case regardless of the edition, as different editions are missing, and examples of accessible second or third editions exist (see weebapp, API output and corresponding images 1, 2, while it's not the case for jdpl-1932-12-30 here, where editions b and c are not available (see : https://gallica.bnf.fr/ark:/12148/cb39294634r/date19331230).

In conclusion: Re-ingesting the data will not solve this issue while it's not fixed on Gallica's side. Since the issues are still not available after three years, it's probable that their unavailability is not due to a simple technical issue, unless it was unnoticed. It would make sense to communicate this list of issues to them and keep track of the evolution of the list.