It is currently quite easy to cause massive slowdowns when using datasets and not familiar with the underlying data conversions by e.g. making bad choices of formatting.
Would it be more user-friendly to set defaults that avoid this as much as possible? e.g. format Array features as numpy arrays rather than python lists
Motivation
Default array formatting leads to slow performance: e.g.
import numpy as np
from datasets import Dataset, Features, Array3D
features=Features(**{"array0": Array3D((None, 10, 10), dtype="float32"), "array1": Array3D((None,10,10), dtype="float32")})
dataset = Dataset.from_dict({f"array{i}": [np.zeros((x,10,10), dtype=np.float32) for x in [2000,1000]*25] for i in range(2)}, features=features)
t0 = time.time()
for ex in ds:
pass
t1 = time.time()
~1.4 s
ds = dataset.to_iterable_dataset()
t0 = time.time()
for ex in ds:
pass
t1 = time.time()
~10s
ds = dataset.with_format("numpy")
t0 = time.time()
for ex in ds:
pass
t1 = time.time()
~0.04s
ds = dataset.to_iterable_dataset().with_format("numpy")
t0 = time.time()
for ex in ds:
pass
t1 = time.time()
Feature request
It is currently quite easy to cause massive slowdowns when using datasets and not familiar with the underlying data conversions by e.g. making bad choices of formatting.
Would it be more user-friendly to set defaults that avoid this as much as possible? e.g. format Array features as numpy arrays rather than python lists
Motivation
Default array formatting leads to slow performance: e.g.
~1.4 s
~10s
~0.04s
~0.04s
Your contribution
May be able to contribute