huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.96k stars 2.62k forks source link

Add support for IIIF in datasets #4041

Open davanstrien opened 2 years ago

davanstrien commented 2 years ago

This is a feature request for support for IIIF in datasets. Apologies for the long issue. I have also used a different format to the usual feature request since I think that makes more sense but happy to use the standard template if preferred.

What is IIIF?

IIIF (International Image Interoperability Framework)

is a set of open standards for delivering high-quality, attributed digital objects online at scale. It’s also an international community developing and implementing the IIIF APIs. IIIF is backed by a consortium of leading cultural institutions.

The tl;dr is that IIIF provides various specifications for implementing useful functionality for:

Some institutions that various levels of support IIF include: The British Library, Internet Archive, Library of Congress, Wikidata. There are also many smaller institutions that have IIIF support. An incomplete list can be found here: https://iiif.io/guides/finding_resources/

IIIF APIs

IIIF consists of a number of APIs which could be integrated with datasets. I think the most obvious candidate for inclusion would be the Image API

IIIF Image API

The Image API https://iiif.io/api/image/3.0/ is likely the most suitable first candidate for integration with datasets. The Image API offers a consistent protocol for requesting images via a URL:

{scheme}://{server}{/prefix}/{identifier}/{region}/{size}/{rotation}/{quality}.{format}

A concrete example of this:

https://stacks.stanford.edu/image/iiif/hg676jb4964%2F0380_796-44/full/full/0/default.jpg

As you can see the scheme offers a number of options that can be specified in the URL, for example, size. Using the example URL we return:

We can change the size to request a size of 250 by 250, this is done by changing the size from full to 250,250 i.e. switching the URL to https://stacks.stanford.edu/image/iiif/hg676jb4964%2F0380_796-44/full/250,250/0/default.jpg

We can also request the image with max width 250, max height 250 whilst maintaining the aspect ratio using !w,h. i.e. change the url to https://stacks.stanford.edu/image/iiif/hg676jb4964%2F0380_796-44/full/!250,250/0/default.jpg

A full overview of the options for size can be found here: https://iiif.io/api/image/3.0/#42-size

Why would/could this be useful for datasets?

There are a few reasons why support for the IIIF Image API could be useful. Broadly the ability to have more control over how an image is returned from a server is useful for many ML workflows:

These may become particularly useful when pre-training models on large image datasets where the cost of downloading images with 1600 pixel width when you actually want 240 has a larger impact.

What could this look like in datasets?

I think there are various ways in which support for IIIF could potentially be included in datasets. These suggestions aren't fully fleshed out but hopefully, give a sense of possible approaches that match existing datasets methods in their approach.

Use through datasets scripts

Loading images via URL is already supported. There are a few possible 'extras' that could be included when using IIIF. One option is to leverage the IIIF protocol in datasets scripts, i.e. the dataset script can expose the IIIF options via the dataset script:

ds = load_dataset("iiif_dataset", image_size="250,250", fmt="jpg")

This is already possible. The approach to parsing the IIIF URLs would be left to the person creating the dataset script.

Support through dataset scripts (with some datasets support)

This is similar to the above but datasets would offer some way of saying this is a iiif URL and then expose the options associated with IIIF images automatically. i.e. if you did something like:

features = {"label": ClassLabel(names=['dog','cat']), 
                    "url": datasets.IIIFURL()}

inside your loading script, you would automatically have exposed size, fmt etc. options when loading the dataset.

Other possible integrations

Some other possible pseudocode ways that a user could interact with IIIF URLs:

The ability to cast to an IIIFImage feature type:

ds.cast_column('url', IIIFImage, download=False)

The ability to specify some options associated with IIIF urls.

ds = ds.set_iiif_options(column='url', size="250,250")

I think all of these would rely on having an IIIFImage feature type - this would be a little bit of a Frankenstein between a string and datasets.Image. I think most of the actual image behaviour would be exactly the same as datasets.Image, the difference would be that the underlying URL could be modified in various ways.

prerequisite requirements

There are a few pre-requisites that I can anticipate. This doesn't cover a full implementation of IIIF support which would have different requirements depending on the approach taken to implementing IIIF. Some of these features would be useful independently of adding IIIF support:

support for handling failed images loaded via a URL (or a specific IIIFImage feature).

Working with images via web requests will inevitably return the odd failed request. If these images are then requests and don't return it would be useful to have a None returned instead of an error. For example, when using push_to_hub datasets will try and include the image but currently fails with bad URLs.

from datasets import Dataset
import datasets 
urls = ['https://stacks.stanford.edu/image/iiif/hg676jb4964%2F0380_796-44/full/!250,250/0/default.jpg']*3
urls.append("badurl.com/image.jpg")
data = {"url":urls}
ds = Dataset.from_dict(data)
ds = ds.cast_column('url', datasets.Image())
ds[3]['url']

returns a FileNotFoundError, for streaming large datasets of images using their URLs it could be useful to have None returned instead. This has implications for the actual training loop i.e. you now need to somehow skip those examples because of this it might not be desirable to support this.

Caching support

Since IIIF requests images via a URL it would be great to have a way of not requesting the images multiple times. This is tracked in https://github.com/huggingface/datasets/issues/3142 and I think this would also be very desirable to have here particularly as one of the primary use cases of IIIF may be to do unsupervised pre-training on large datasets of IIIF URLs.

Support for Parsing IIIF URLs

This gets closer to the actual implementation. Here the requirement would be some way for datasets to parse a URL that the users specify is an IIIF URL. An example of a Python library that does this: https://github.com/Princeton-CDH/piffle. I also have a rough version that uses dataclasses which I can share.

Why it might not be worthwhile/suitable for datasets

There are some reasons that this might not be worth implementing:

Some of the requirements outlined above would be useful for images anyway. These could be implemented prior to a final decision about whether IIIF support could/should be added to datasets.

Suggested next steps:

I realise this is a long and slightly open-ended issue. I am happy to clarify/answer questions on IIIF and possible integrations. If the prerequisite requirements seem worth exploring/are better explored in their own issues let me know and I can open new issues for those.

mariosasko commented 2 years ago

Hi! Thanks for the detailed analysis of adding IIIF support. I like the idea of "using IIIF through datasets scripts" due to its ease of use. Another approach that I like is yielding image ids and using the piffle library (which offers a bit more flexibility) + map to download + cache images. We can handle bad URLs in map by returning None. Plus, we can add a Dataset Preprocessing section with the code that explains this approach to the card of such datasets. WDYT?

currently, IIIF is mainly used by cultural heritage organizations (museums, archives etc.) The adoption of IIIF in this sector has been growing but it's possible that adoption won't be extended to other industries which may also be a source of image data for training ML models.

This is why (currently) adding a new feature type would be overkill, IMO.