huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.29k stars 2.7k forks source link

Preserve features in iterable dataset.filter #7209

Closed alex-hh closed 1 month ago

alex-hh commented 1 month ago

Fixes example in #7208 - I'm not sure what other checks I should do? @lhoestq

I also haven't thought hard about the concatenate / interleaving example iterables but think this might work assuming that features are either all identical or None?

lhoestq commented 1 month ago

Yes your assumption on concatenate/interleave is ok imo.

It seems the TypedExamplesIterable can slow down things, it should take formatting into account to not convert numpy arrays to python lists

right now it's slow (unrelatedly to your PR):

>>> ds = Dataset.from_dict({"a": np.zeros((1000, 32, 32))}).to_iterable_dataset().with_format("np")
>>> filtered_ds = ds.filter(lambda x: True)
>>> %time sum(1 for _ in ds)
CPU times: user 175 ms, sys: 8.1 ms, total: 183 ms
Wall time: 184 ms
1000
>>> %time sum(1 for _ in filtered_ds)
CPU times: user 4.1 s, sys: 8.41 ms, total: 4.1 s
Wall time: 4.12 s
1000
HuggingFaceDocBuilderDev commented 1 month ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

alex-hh commented 1 month ago

It seems the TypedExamplesIterable can slow down things, it should take formatting into account to not convert numpy arrays to python lists

Should be fixed by updated #7207 I hope!