It is currently quite easy to cause massive slowdowns when using datasets and not familiar with the underlying data conversions by e.g. making bad choices of formatting.
Would it be more user-friendly to set defaults that avoid this as much as possible? e.g. format Array features as numpy arrays rather than python lists
Default array formatting leads to slow performance: e.g.
import numpy as np
from datasets import Dataset, Features, Array3D
features=Features(**{"array0": Array3D((None, 10, 10), dtype="float32"), "array1": Array3D((None,10,10), dtype="float32")})
dataset = Dataset.from_dict({f"array{i}": [np.zeros((x,10,10), dtype=np.float32) for x in [2000,1000]*25] for i in range(2)}, features=features)
t0 = time.time()
for ex in ds:
t1 = time.time()
~1.4 s
ds = dataset.to_iterable_dataset()
t0 = time.time()
for ex in ds:
t1 = time.time()
ds = dataset.with_format("numpy")
t0 = time.time()
for ex in ds:
t1 = time.time()
ds = dataset.to_iterable_dataset().with_format("numpy")
t0 = time.time()
for ex in ds:
t1 = time.time()
Feature request
It is currently quite easy to cause massive slowdowns when using datasets and not familiar with the underlying data conversions by e.g. making bad choices of formatting.
Would it be more user-friendly to set defaults that avoid this as much as possible? e.g. format Array features as numpy arrays rather than python lists
Default array formatting leads to slow performance: e.g.
~1.4 s
Your contribution
May be able to contribute