huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.98k stars 2.62k forks source link

Support custom feature types #5766

Open jmontalt opened 1 year ago

jmontalt commented 1 year ago

Feature request

I think it would be nice to allow registering custom feature types with the 🤗 Datasets library. For example, allow to do something along the following lines:

from datasets.features import register_feature_type  # this would be a new function

@register_feature_type
class CustomFeatureType:
    def encode_example(self, value):
        """User-provided logic to encode an example of this feature."""
        pass

    def decode_example(self, value, token_per_repo_id=None):
        """User-provided logic to decode an example of this feature."""
        pass

Motivation

Users of 🤗 Datasets, such as myself, may want to use the library to load datasets with unsupported feature types (i.e., beyond ClassLabel, Image, or Audio). This would be useful for prototyping new feature types and for feature types that aren't used widely enough to warrant inclusion in 🤗 Datasets.

At the moment, this is only possible by monkey-patching 🤗 Datasets, which obfuscates the code and is prone to breaking with library updates. It also requires the user to write some custom code which could be easily avoided.

Your contribution

I would be happy to contribute this feature. My proposed solution would involve changing the following call to globals() to an explicit feature type registry, which a user-facing register_feature_type decorator could update.

https://github.com/huggingface/datasets/blob/fd893098627230cc734f6009ad04cf885c979ac4/src/datasets/features/features.py#L1329

I would also provide an abstract base class for custom feature types which users could inherit. This would have at least an encode_example method and a decode_example method, similar to Image or Audio.

The existing encode_nested_example and decode_nested_example functions would also need to be updated to correctly call the corresponding functions for the new type.

lhoestq commented 1 year ago

Hi ! Interesting :) What kind of new types would you like to use ?

Note that you can already implement your own decoding by using set_transform that can decode data on-the-fly when rows are accessed

mariosasko commented 1 year ago

An interesting proposal indeed.

Pandas and Polars have the "extension API", so doing something similar on our side could be useful, too. However, this requires defining a common interface for the existing feature types before discussing the API/workflow for defining/sharing custom feature types, and this could take some time.

It would also be nice if the datasets viewer could render these custom types.

jmontalt commented 1 year ago

Thank you for your replies! @lhoestq I have a use case involving whole-slide images in digital pathology. These are very large images (potentially gigapixel scale), so standard image tools are not suitable. Essentially, encoding/decoding can be done from/to OpenSlide objects. Though there may be interest in this use case from the digital pathology community, it may not be sufficiently useful to suggest adding the feature type, but there will likely be many other use cases for a generic custom feature type.

Thank you for pointing out set_transform! I will make sure to keep this in mind in the future.

@mariosasko An "extension API" sounds like a good idea, though I understand that this needs to be properly defined, and that you will need to discuss it internally. Support from the viewer would be awesome, too, though the generalization to arbitrary types sounds challenging.

For now, happy to know that you're considering the feature. Feel free to let me know if I can do anything to support the process.

zux-hidden commented 6 months ago

Not a beautiful solution, but we use this for now

import datasets.features.features
old_decode_fn = datasets.features.features.decode_nested_example
def decode_ext_fn(schema, obj, token_per_repo_id = None):
        #Decode new type here

        return old_decode_fn(schema, obj, token_per_repo_id)
datasets.features.features.decode_nested_example = decode_ext_fn