lovasoa / dezoomify

Dezoomify is a web application to download zoomable images from museum websites, image galleries, and map viewers. Many different zoomable image technologies are supported.
https://dezoomify.ophir.dev
GNU General Public License v2.0
657 stars 71 forks source link

[new site support] please support IIIF manifests to download the full scanned documents #416

Open jsbien opened 4 years ago

jsbien commented 4 years ago

Site name and desciption

Polish national digital library: https://polona.pl/

Example URLs

A sample manifest: https://polona.pl/iiif/item/MTI2MzI0NjU/manifest.json

Current error message

There is no problem with downloading a single page, but my goal is to download a multivolume dictionary of about 5 thousand pages. I had a look at https://github.com/intranda/goobi-iiif-downloader but have no idea how to use it.

lovasoa commented 4 years ago

Hello, Dezoomify, by design, downloads only a single file at a time. If you want to automate the download of a large number of images, you should have a look at dezoomify-rs. It is a command-line tool (also developed by me) that you can integrate with other tools to build complex behaviors. For instance, you can solve your problem with the following command line :

curl "https://polona.pl/iiif/item/MTI2MzI0NjU/manifest.json" | jq -r ".items[].id" | xargs -n 1 dezoomify-rs -l

It uses curl to download the manifest, jq to extract the list of zoomable image URLs, and xargs to launch multiple instances of dezoomify-rs, each one downloading a single image.

This command line can be run in a bash shell. If you are using windows or MacOS, just run it in a terminal. In windows, you can use WSL

Spinozabento commented 4 years ago

Hi, Can IIIF manifests from the British Library Endangered Archives Programme like https://eap.bl.uk/archive-file/EAP790-14-1/manifest be retrieved by the above-mentioned command line or they have to be formatted into the urls.txt for batch/bash scripts as you have mentioned here?

lovasoa commented 4 years ago

Hi, Yes, in a similar manner as the above, you can extract the list of URLs and then launch dezoomify-rs on each one. You just have to adapt the path inside the jq command to your case:

curl "https://eap.bl.uk/archive-file/EAP790-14-1/manifest" | jq -r '.sequences[].canvases[].images[].resource.service."@id" + "/info.json"' | xargs -n 1 dezoomify-rs -l 

And if you want to avoid overwhelming their server with too many requests, you can add the following parameters :

dezoomify-rs -l --parallelism 1 --timeout 60s --retry-delay 10s 

This will make the download slower, but more reliable.

Spinozabento commented 4 years ago

Thanks a lot for the code.

jsbien commented 3 years ago

I finally tried your suggestions, thank you very much! I will make some comments on the dezoomify-rs site.