Open davanstrien opened 2 years ago
Hi! Thanks for the detailed analysis of adding IIIF support. I like the idea of "using IIIF through datasets scripts" due to its ease of use. Another approach that I like is yielding image ids and using the piffle
library (which offers a bit more flexibility) + map
to download + cache images. We can handle bad URLs in map
by returning None
. Plus, we can add a Dataset Preprocessing
section with the code that explains this approach to the card of such datasets. WDYT?
currently, IIIF is mainly used by cultural heritage organizations (museums, archives etc.) The adoption of IIIF in this sector has been growing but it's possible that adoption won't be extended to other industries which may also be a source of image data for training ML models.
This is why (currently) adding a new feature type would be overkill, IMO.
This is a feature request for support for IIIF in
datasets
. Apologies for the long issue. I have also used a different format to the usual feature request since I think that makes more sense but happy to use the standard template if preferred.What is IIIF?
IIIF (International Image Interoperability Framework)
The tl;dr is that IIIF provides various specifications for implementing useful functionality for:
Some institutions that various levels of support IIF include: The British Library, Internet Archive, Library of Congress, Wikidata. There are also many smaller institutions that have IIIF support. An incomplete list can be found here: https://iiif.io/guides/finding_resources/
IIIF APIs
IIIF consists of a number of APIs which could be integrated with datasets. I think the most obvious candidate for inclusion would be the Image API
IIIF Image API
The Image API https://iiif.io/api/image/3.0/ is likely the most suitable first candidate for integration with datasets. The Image API offers a consistent protocol for requesting images via a URL:
{scheme}://{server}{/prefix}/{identifier}/{region}/{size}/{rotation}/{quality}.{format}
A concrete example of this:
https://stacks.stanford.edu/image/iiif/hg676jb4964%2F0380_796-44/full/full/0/default.jpg
As you can see the scheme offers a number of options that can be specified in the URL, for example, size. Using the example URL we return:
We can change the size to request a size of 250 by 250, this is done by changing the size from
full
to250,250
i.e. switching the URL tohttps://stacks.stanford.edu/image/iiif/hg676jb4964%2F0380_796-44/full/250,250/0/default.jpg
We can also request the image with max width 250, max height 250 whilst maintaining the aspect ratio using
!w,h
. i.e. change the url tohttps://stacks.stanford.edu/image/iiif/hg676jb4964%2F0380_796-44/full/!250,250/0/default.jpg
A full overview of the options for size can be found here: https://iiif.io/api/image/3.0/#42-size
Why would/could this be useful for datasets?
There are a few reasons why support for the IIIF Image API could be useful. Broadly the ability to have more control over how an image is returned from a server is useful for many ML workflows:
These may become particularly useful when pre-training models on large image datasets where the cost of downloading images with 1600 pixel width when you actually want 240 has a larger impact.
What could this look like in datasets?
I think there are various ways in which support for IIIF could potentially be included in
datasets
. These suggestions aren't fully fleshed out but hopefully, give a sense of possible approaches that match existingdatasets
methods in their approach.Use through datasets scripts
Loading images via URL is already supported. There are a few possible 'extras' that could be included when using IIIF. One option is to leverage the IIIF protocol in datasets scripts, i.e. the dataset script can expose the IIIF options via the dataset script:
This is already possible. The approach to parsing the IIIF URLs would be left to the person creating the dataset script.
Support through dataset scripts (with some datasets support)
This is similar to the above but
datasets
would offer some way of saying this is a iiif URL and then expose the options associated with IIIF images automatically. i.e. if you did something like:inside your loading script, you would automatically have exposed
size
,fmt
etc. options when loading the dataset.Other possible integrations
Some other possible pseudocode ways that a user could interact with IIIF URLs:
The ability to cast to an
IIIFImage
feature type:The ability to specify some options associated with IIIF urls.
I think all of these would rely on having an
IIIFImage
feature type - this would be a little bit of a Frankenstein between astring
anddatasets.Image
. I think most of the actual image behaviour would be exactly the same asdatasets.Image
, the difference would be that the underlying URL could be modified in various ways.prerequisite requirements
There are a few pre-requisites that I can anticipate. This doesn't cover a full implementation of IIIF support which would have different requirements depending on the approach taken to implementing IIIF. Some of these features would be useful independently of adding IIIF support:
support for handling failed images loaded via a URL (or a specific IIIFImage feature).
Working with images via web requests will inevitably return the odd failed request. If these images are then requests and don't return it would be useful to have a
None
returned instead of an error. For example, when usingpush_to_hub
datasets
will try and include the image but currently fails with bad URLs.returns a
FileNotFoundError
, for streaming large datasets of images using their URLs it could be useful to haveNone
returned instead. This has implications for the actual training loop i.e. you now need to somehow skip those examples because of this it might not be desirable to support this.Caching support
Since IIIF requests images via a URL it would be great to have a way of not requesting the images multiple times. This is tracked in https://github.com/huggingface/datasets/issues/3142 and I think this would also be very desirable to have here particularly as one of the primary use cases of IIIF may be to do unsupervised pre-training on large datasets of IIIF URLs.
Support for Parsing IIIF URLs
This gets closer to the actual implementation. Here the requirement would be some way for
datasets
to parse a URL that the users specify is an IIIF URL. An example of a Python library that does this: https://github.com/Princeton-CDH/piffle. I also have a rough version that usesdataclasses
which I can share.Why it might not be worthwhile/suitable for datasets
There are some reasons that this might not be worth implementing:
datasets
libraries' role to protect users from.Some of the requirements outlined above would be useful for images anyway. These could be implemented prior to a final decision about whether IIIF support could/should be added to datasets.
Suggested next steps:
I realise this is a long and slightly open-ended issue. I am happy to clarify/answer questions on IIIF and possible integrations. If the prerequisite requirements seem worth exploring/are better explored in their own issues let me know and I can open new issues for those.