huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.29k stars 2.7k forks source link

Iterable dataset map with explicit features causes slowdown for Sequence features #7215

Open alex-hh opened 1 month ago

alex-hh commented 1 month ago

Describe the bug

When performing map, it's nice to be able to pass the new feature type, and indeed required by interleave and concatenate datasets.

However, this can cause a major slowdown for certain types of array features due to the features being re-encoded.

This is separate to the slowdown reported in #7206

Steps to reproduce the bug

from datasets import Dataset, Features, Array3D, Sequence, Value
import numpy as np
import time
features=Features(**{"array0": Sequence(feature=Value("float32"), length=-1), "array1": Sequence(feature=Value("float32"), length=-1)})
dataset = Dataset.from_dict({f"array{i}": [np.zeros((x,), dtype=np.float32) for x in [5000,10000]*25] for i in range(2)}, features=features)
ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy").map(lambda x: x)
t0 = time.time()
for ex in ds:
    pass
t1 = time.time()

~1.5 s on main

ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy").map(lambda x: x, features=features)
t0 = time.time()
for ex in ds:
    pass
t1 = time.time()

~ 3 s on main

Expected behavior

I'm not 100% sure whether passing new feature types to formatted outputs of map should be supported or not, but assuming it should, then there should be a cost-free way to specify the new feature type - knowing feature type is required by interleave_datasets and concatenate_datasets for example

Environment info

3.0.2