When performing map, it's nice to be able to pass the new feature type, and indeed required by interleave and concatenate datasets.
However, this can cause a major slowdown for certain types of array features due to the features being re-encoded.
This is separate to the slowdown reported in #7206
Steps to reproduce the bug
from datasets import Dataset, Features, Array3D, Sequence, Value
import numpy as np
import time
features=Features(**{"array0": Sequence(feature=Value("float32"), length=-1), "array1": Sequence(feature=Value("float32"), length=-1)})
dataset = Dataset.from_dict({f"array{i}": [np.zeros((x,), dtype=np.float32) for x in [5000,10000]*25] for i in range(2)}, features=features)
ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy").map(lambda x: x)
t0 = time.time()
for ex in ds:
pass
t1 = time.time()
~1.5 s on main
ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy").map(lambda x: x, features=features)
t0 = time.time()
for ex in ds:
pass
t1 = time.time()
~ 3 s on main
Expected behavior
I'm not 100% sure whether passing new feature types to formatted outputs of map should be supported or not, but assuming it should, then there should be a cost-free way to specify the new feature type - knowing feature type is required by interleave_datasets and concatenate_datasets for example
Describe the bug
When performing map, it's nice to be able to pass the new feature type, and indeed required by interleave and concatenate datasets.
However, this can cause a major slowdown for certain types of array features due to the features being re-encoded.
This is separate to the slowdown reported in #7206
Steps to reproduce the bug
~1.5 s on main
~ 3 s on main
Expected behavior
I'm not 100% sure whether passing new feature types to formatted outputs of map should be supported or not, but assuming it should, then there should be a cost-free way to specify the new feature type - knowing feature type is required by interleave_datasets and concatenate_datasets for example
Environment info
3.0.2