huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.29k stars 2.7k forks source link

Custom features not compatible with special encoding/decoding logic #7220

Open alex-hh opened 1 month ago

alex-hh commented 1 month ago

Describe the bug

It is possible to register custom features using datasets.features.features.register_feature (https://github.com/huggingface/datasets/pull/6727)

However such features are not compatible with Features.encode_example/decode_example if they require special encoding / decoding logic because encode_nested_example / decode_nested_example checks whether the feature is in a fixed list of encodable types:

https://github.com/huggingface/datasets/blob/16a121d7821a7691815a966270f577e2c503473f/src/datasets/features/features.py#L1349

This prevents the extensibility of features to complex cases

Steps to reproduce the bug

class ListOfStrs:
    def encode_example(self, value):
        if isinstance(value, str):
            return [str]
        else:
            return value
feats = Features(strlist=ListOfStrs())
assert feats.encode_example({"strlist": "a"})["strlist"] = feats["strlist"].encode_example("a")}

Expected behavior

Registered feature types should be encoded based on some property of the feature (e.g. requires_encoding)?

Environment info

3.0.2

lhoestq commented 2 weeks ago

I think you can fix this simply by replacing the line with hardcoded features with hastattr(schema, "encode_example") actually

alex-hh commented 2 weeks ago

7284