huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.28k stars 2.7k forks source link

add CustomFeature base class to support user-defined features with encoding/decoding logic #7221

Open alex-hh opened 1 month ago

alex-hh commented 1 month ago

intended as fix for #7220 if this kind of extensibility is something that datasets is willing to support!

from datasets.features.features import CustomFeature

class ListOfStrs(CustomFeature):
    requires_encoding = True
    def _encode_example(self, value):
        if isinstance(value, str):
            return [str]
        else:
            return value
feats = Features(strlist=ListOfStrs())
feats.encode_example({"strlist": "a"})["strlist"] == feats["strlist"].encode_example("a")
alex-hh commented 2 weeks ago

@lhoestq would you be open to supporting this kind of extensibility?

lhoestq commented 2 weeks ago

I suggested a fix in https://github.com/huggingface/datasets/issues/7220 that would not necessarily require a parent class for custom features, lmk what you think