huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.28k stars 2.7k forks source link

Convert Array features to numpy arrays rather than lists by default #7210

Open alex-hh opened 1 month ago

alex-hh commented 1 month ago

Feature request

It is currently quite easy to cause massive slowdowns when using datasets and not familiar with the underlying data conversions by e.g. making bad choices of formatting.

Would it be more user-friendly to set defaults that avoid this as much as possible? e.g. format Array features as numpy arrays rather than python lists

Motivation

Default array formatting leads to slow performance: e.g.

import numpy as np
from datasets import Dataset, Features, Array3D
features=Features(**{"array0": Array3D((None, 10, 10), dtype="float32"), "array1": Array3D((None,10,10), dtype="float32")})
dataset = Dataset.from_dict({f"array{i}": [np.zeros((x,10,10), dtype=np.float32) for x in [2000,1000]*25] for i in range(2)}, features=features)
t0 = time.time()
for ex in ds:
   pass
t1 = time.time()

~1.4 s

ds = dataset.to_iterable_dataset()
t0 = time.time()
for ex in ds:
    pass
t1 = time.time()

~10s

ds = dataset.with_format("numpy")
t0 = time.time()
for ex in ds:
    pass
t1 = time.time()

~0.04s

ds = dataset.to_iterable_dataset().with_format("numpy")
t0 = time.time()
for ex in ds:
    pass
t1 = time.time()

~0.04s

Your contribution

May be able to contribute