internetarchive / iiif

The official Internet Archive IIIF service
GNU General Public License v3.0
21 stars 4 forks source link

Issue with multiple page item from IIIF training #46

Closed glenrobson closed 3 months ago

glenrobson commented 7 months ago

This came up in the IIIF training:

https://archive.org/details/st-anthony-relics-01/

It contains 5 images but the v3 manifets contains 1 image:

https://iiif.archive.org/iiif/3/st-anthony-relics-01/manifest.json

and the v2 manifest doesn't work:

https://iiif.archive.org/iiif/2/st-anthony-relics-01/manifest.json

digitaldogsbody commented 7 months ago

This is caused by the same issue we were discussing on Slack with Sara, our code expects type image to just be a single image, and anything multiple uses texts and the <identifier>_images.zip file construction.

Should be an easy enough fix, we can just look to see how many image files with type: original there are in the metadata when processing an image type record

digitaldogsbody commented 7 months ago

Although looking at the item in question, I think the mediatype might be wrong, as in addition to the static images in the top level directory, there are also zipfiles with JP2s that look as if they have been processed by the BookReader code, but because the mediatype is not texts, you can't access them in the IA interface (and we would never expose them via IIIF)

digitaldogsbody commented 7 months ago

The images for these zipped items are available via Cantaloupe: https://iiif.archive.org/image/iiif/3/st-anthony-relics-01%2fASB-Consorzio_jp2.zip%2fASB-Consorzio_jp2%2FASB-Consorzio_0000.jp2/full/max/0/default.jpg

digitaldogsbody commented 7 months ago

The V2 manifest issue may be related - the code downloads the item in order to open it with PIL to get dimension data etc, but it is quite naive and it ends up downloading the first original file from the item, which here is one of the PDFs, so then PIL errors out:

[2023-12-14 11:01:28,724] ERROR in app: Exception on /iiif/2/st-anthony-relics-01/manifest.json [GET]
Traceback (most recent call last):
  File "/home/mike/projects/iiif.archive.org/venv/lib/python3.10/site-packages/flask/app.py", line 2528, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/mike/projects/iiif.archive.org/venv/lib/python3.10/site-packages/flask/app.py", line 1825, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/mike/projects/iiif.archive.org/venv/lib/python3.10/site-packages/flask_cors/extension.py", line 165, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "/home/mike/projects/iiif.archive.org/venv/lib/python3.10/site-packages/flask/app.py", line 1823, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/mike/projects/iiif.archive.org/venv/lib/python3.10/site-packages/flask/app.py", line 1799, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/home/mike/projects/iiif.archive.org/iiify/app.py", line 208, in manifest2
    return ldjsonify(create_manifest(identifier, domain=domain, page=page))
  File "/home/mike/projects/iiif.archive.org/iiify/resolver.py", line 208, in create_manifest
    info = web.info(domain, path)
  File "/home/mike/projects/iiif.archive.org/venv/lib/python3.10/site-packages/iiif2/web.py", line 32, in info
    w, h = Image.open(path).size
  File "/home/mike/projects/iiif.archive.org/venv/lib/python3.10/site-packages/PIL/Image.py", line 3283, in open
    raise UnidentifiedImageError(msg)
PIL.UnidentifiedImageError: cannot identify image file 'media/st-anthony-relics-01'

This is the file that it downloads:

(venv) mike@revelator:~/projects/iiif.archive.org$ ls -lta media/st-anthony-relics-01 
-rw-rw-r-- 1 mike mike 8653052 Dec 14 11:00 media/st-anthony-relics-01
(venv) mike@revelator:~/projects/iiif.archive.org$ file media/st-anthony-relics-01 
media/st-anthony-relics-01: PDF document, version 1.3

And the matching type and bytesize from the metadata:

{
  "created": 1702519976,
  "d1": "ia601202.us.archive.org",
  "d2": "ia801202.us.archive.org",
  "dir": "/22/items/st-anthony-relics-01",
  "files": [
    {
      "name": "ASB-Consorzio.pdf",
      "source": "original",
      "mtime": "1702313965",
      "size": "8653052",
      "md5": "4f9f26a566c797410ebd05b58596e8de",
      "crc32": "ad52676c",
      "sha1": "6e482b42e9d670088fa4816f235345f0b195c6e8",
      "format": "Image Container PDF",
      "viruscheck": "1702314437"
    },
<snip>

So I think I would say that we should deal with the multiple images in an image mediatype object (regardless of whether this object should actually be texts), but the V2 issue is a wontfix.

glenrobson commented 6 months ago

Duplicate of #52

glenrobson commented 5 months ago

The images for these zipped items are available via Cantaloupe: https://iiif.archive.org/image/iiif/3/st-anthony-relics-01%2fASB-Consorzio_jp2.zip%2fASB-Consorzio_jp2%2FASB-Consorzio_0000.jp2/full/max/0/default.jpg

Strangly these are different images to the ones that are shown on the Internet Archive page: https://archive.org/details/st-anthony-relics-01/

The fix for multiple files seems to have worked for this one but Cantaloupe doesn't like the image files:

Failed to get https://iiif.archive.org/image/iiif/3/st-anthony-relics-01%2fStAnthony-Relics_01.jpeg Failed to get https://iiif.archive.org/image/iiif/3/st-anthony-relics-01%2fStAnthony-Relics_02.jpeg Failed to get https://iiif.archive.org/image/iiif/3/st-anthony-relics-01%2fStAnthony-Relics_03.jpeg

Which returns "Unsupported source format". The following two work OK:

https://iiif.archive.org/image/iiif/3/st-anthony-relics-01%2FAuronzo-ComuneCortina.jpeg https://iiif.archive.org/image/iiif/3/st-anthony-relics-01%2FCadore-Becher_1998.jpg

glenrobson commented 4 months ago

Re-target as a cantaloupe issue.

glenrobson commented 3 months ago

This is a very complicated example of:

https://github.com/internetarchive/iiif/issues/12

It contains a number of PDF documents and jpg images. Currently only the PDFs are shown in the Internet Archive viewer.