Add support for IIIF in datasets

This is a feature request for support for IIIF in datasets. Apologies for the long issue. I have also used a different format to the usual feature request since I think that makes more sense but happy to use the standard template if preferred.

What is IIIF?

IIIF (International Image Interoperability Framework)

is a set of open standards for delivering high-quality, attributed digital objects online at scale. It’s also an international community developing and implementing the IIIF APIs. IIIF is backed by a consortium of leading cultural institutions.

The tl;dr is that IIIF provides various specifications for implementing useful functionality for:

Institutions to make available images for various use cases
Users to have a consistent way of interacting/requesting these images
For developers to have a common standard for developing tools for working with IIIF images that will work across all institutions that implement a particular IIIF standard (for example the image viewer for the BNF can also work for the Library of Congress if they both use IIIF).

Some institutions that various levels of support IIF include: The British Library, Internet Archive, Library of Congress, Wikidata. There are also many smaller institutions that have IIIF support. An incomplete list can be found here: https://iiif.io/guides/finding_resources/

IIIF APIs

IIIF consists of a number of APIs which could be integrated with datasets. I think the most obvious candidate for inclusion would be the Image API

IIIF Image API

The Image API https://iiif.io/api/image/3.0/ is likely the most suitable first candidate for integration with datasets. The Image API offers a consistent protocol for requesting images via a URL:

{scheme}://{server}{/prefix}/{identifier}/{region}/{size}/{rotation}/{quality}.{format}

A concrete example of this:

https://stacks.stanford.edu/image/iiif/hg676jb4964%2F0380_796-44/full/full/0/default.jpg

As you can see the scheme offers a number of options that can be specified in the URL, for example, size. Using the example URL we return:

We can change the size to request a size of 250 by 250, this is done by changing the size from full to 250,250 i.e. switching the URL to https://stacks.stanford.edu/image/iiif/hg676jb4964%2F0380_796-44/full/250,250/0/default.jpg

We can also request the image with max width 250, max height 250 whilst maintaining the aspect ratio using !w,h. i.e. change the url to https://stacks.stanford.edu/image/iiif/hg676jb4964%2F0380_796-44/full/!250,250/0/default.jpg

A full overview of the options for size can be found here: https://iiif.io/api/image/3.0/#42-size

Why would/could this be useful for datasets?

There are a few reasons why support for the IIIF Image API could be useful. Broadly the ability to have more control over how an image is returned from a server is useful for many ML workflows:

images can be requested in the right size, this prevents having to download/stream large images when the actual desired size is much smaller
can select a subset of an image: it is possible to select a sub-region of an image, this could be useful for example when you already have a bounding box for a subset of an image and then want to use this subset of an image for another task. For example, https://github.com/Living-with-machines/nnanno uses IIIF to request parts of a newspaper image that have been detected as 'photograph', 'illustration' etc for downstream use.
options for quality, rotation, the format can all be encoded in the URL request.

These may become particularly useful when pre-training models on large image datasets where the cost of downloading images with 1600 pixel width when you actually want 240 has a larger impact.

What could this look like in datasets?

I think there are various ways in which support for IIIF could potentially be included in datasets. These suggestions aren't fully fleshed out but hopefully, give a sense of possible approaches that match existing datasets methods in their approach.

Use through datasets scripts

Loading images via URL is already supported. There are a few possible 'extras' that could be included when using IIIF. One option is to leverage the IIIF protocol in datasets scripts, i.e. the dataset script can expose the IIIF options via the dataset script:

ds = load_dataset("iiif_dataset", image_size="250,250", fmt="jpg")

This is already possible. The approach to parsing the IIIF URLs would be left to the person creating the dataset script.

Support through dataset scripts (with some datasets support)

This is similar to the above but datasets would offer some way of saying this is a iiif URL and then expose the options associated with IIIF images automatically. i.e. if you did something like:

features = {"label": ClassLabel(names=['dog','cat']), 
                    "url": datasets.IIIFURL()}

inside your loading script, you would automatically have exposed size, fmt etc. options when loading the dataset.

Other possible integrations

Some other possible pseudocode ways that a user could interact with IIIF URLs:

The ability to cast to an IIIFImage feature type:

ds.cast_column('url', IIIFImage, download=False)

The ability to specify some options associated with IIIF urls.

ds = ds.set_iiif_options(column='url', size="250,250")

I think all of these would rely on having an IIIFImage feature type - this would be a little bit of a Frankenstein between a string and datasets.Image. I think most of the actual image behaviour would be exactly the same as datasets.Image, the difference would be that the underlying URL could be modified in various ways.

prerequisite requirements

There are a few pre-requisites that I can anticipate. This doesn't cover a full implementation of IIIF support which would have different requirements depending on the approach taken to implementing IIIF. Some of these features would be useful independently of adding IIIF support:

support for handling failed images loaded via a URL (or a specific IIIFImage feature).

Working with images via web requests will inevitably return the odd failed request. If these images are then requests and don't return it would be useful to have a None returned instead of an error. For example, when using push_to_hub datasets will try and include the image but currently fails with bad URLs.

from datasets import Dataset
import datasets 
urls = ['https://stacks.stanford.edu/image/iiif/hg676jb4964%2F0380_796-44/full/!250,250/0/default.jpg']*3
urls.append("badurl.com/image.jpg")
data = {"url":urls}
ds = Dataset.from_dict(data)
ds = ds.cast_column('url', datasets.Image())
ds[3]['url']

returns a FileNotFoundError, for streaming large datasets of images using their URLs it could be useful to have None returned instead. This has implications for the actual training loop i.e. you now need to somehow skip those examples because of this it might not be desirable to support this.

Caching support

Since IIIF requests images via a URL it would be great to have a way of not requesting the images multiple times. This is tracked in https://github.com/huggingface/datasets/issues/3142 and I think this would also be very desirable to have here particularly as one of the primary use cases of IIIF may be to do unsupervised pre-training on large datasets of IIIF URLs.

Support for Parsing IIIF URLs

This gets closer to the actual implementation. Here the requirement would be some way for datasets to parse a URL that the users specify is an IIIF URL. An example of a Python library that does this: https://github.com/Princeton-CDH/piffle. I also have a rough version that uses dataclasses which I can share.

Why it might not be worthwhile/suitable for datasets

There are some reasons that this might not be worth implementing:

currently, IIIF is mainly used by cultural heritage organizations (museums, archives etc.) The adoption of IIIF in this sector has been growing but it's possible that adoption won't be extended to other industries which may also be a source of image data for training ML models.
It may end up being better to leave this to the user. It would for example be possible for someone to write map functions to change an IIIF URL to the correct size etc. Adding direct support for IIIF in datasets may potentially not be worth the trouble.
The impact of different approaches to doing image scaling can impact the downstream model's performance, see: https://twitter.com/wightmanr/status/1479528581466243073?s=20. Since different IIIF image servers may implement different approaches to resizing images this could have a downstream impact on model performance. think this is something that could be flagged to the end-user in the documentation. This probably also falls into general "gotchas" that probably aren't the datasets libraries' role to protect users from.

Some of the requirements outlined above would be useful for images anyway. These could be implemented prior to a final decision about whether IIIF support could/should be added to datasets.

Suggested next steps:

I realise this is a long and slightly open-ended issue. I am happy to clarify/answer questions on IIIF and possible integrations. If the prerequisite requirements seem worth exploring/are better explored in their own issues let me know and I can open new issues for those.

huggingface / datasets