huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.28k stars 2.7k forks source link

Composite (multi-column) features #7228

Open alex-hh opened 1 month ago

alex-hh commented 1 month ago

Feature request

Structured data types (graphs etc.) might often be most efficiently stored as multiple columns, which then need to be combined during feature decoding

Although it is currently possible to nest features as structs, my impression is that in particular when dealing with e.g. a feature composed of multiple numpy array / ArrayXD's, it would be more efficient to store each ArrayXD as a separate column (though I'm not sure by how much)

Perhaps specification / implementation could be supported by something like:

features=Features(**{("feature0", "feature1")=Features(feature0=Array2D((None,10), dtype="float32"), feature1=Array2D((None,10), dtype="float32"))

Motivation

Defining efficient composite feature types based on numpy arrays for representing data such as graphs with multiple node and edge attributes is currently challenging.

Your contribution

Possibly able to contribute