huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.28k stars 2.7k forks source link

Slow iteration for iterable dataset with numpy formatting for array data #7206

Open alex-hh opened 1 month ago

alex-hh commented 1 month ago

Describe the bug

When working with large arrays, setting with_format to e.g. numpy then applying map causes a significant slowdown for iterable datasets.

Steps to reproduce the bug

import numpy as np
import time
from datasets import Dataset, Features, Array3D

features=Features(**{"array0": Array3D((None, 10, 10), dtype="float32"), "array1": Array3D((None,10,10), dtype="float32")})
dataset = Dataset.from_dict({f"array{i}": [np.zeros((x,10,10), dtype=np.float32) for x in [2000,1000]*25] for i in range(2)}, features=features)

Then

ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy").map(lambda x: x)
t0 = time.time()
for ex in ds:
    pass
t1 = time.time()
print(t1-t0)

takes 27 s, whereas

ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy")
ds = dataset.to_iterable_dataset()
t0 = time.time()
for ex in ds:
    pass
t1 = time.time()
print(t1 - t0)

takes ~1s

Expected behavior

Map should not introduce a slowdown when formatting is enabled.

Environment info

3.0.2

tux-type commented 1 month ago

The below easily eats up 32G of RAM. Leaving it for a while bricked the laptop with 16GB.

dataset = load_dataset("Voxel51/OxfordFlowers102", data_dir="data").with_format("numpy")
processed_dataset = dataset.map(lambda x: x)

image

Similar problems occur if using a real transform function in .map().