Preserve features in iterable dataset.filter

alex-hh commented 1 month ago

Fixes example in #7208 - I'm not sure what other checks I should do? @lhoestq

I also haven't thought hard about the concatenate / interleaving example iterables but think this might work assuming that features are either all identical or None?

lhoestq commented 1 month ago

Yes your assumption on concatenate/interleave is ok imo.

It seems the TypedExamplesIterable can slow down things, it should take formatting into account to not convert numpy arrays to python lists

right now it's slow (unrelatedly to your PR):

>>> ds = Dataset.from_dict({"a": np.zeros((1000, 32, 32))}).to_iterable_dataset().with_format("np")
>>> filtered_ds = ds.filter(lambda x: True)
>>> %time sum(1 for _ in ds)
CPU times: user 175 ms, sys: 8.1 ms, total: 183 ms
Wall time: 184 ms
1000
>>> %time sum(1 for _ in filtered_ds)
CPU times: user 4.1 s, sys: 8.41 ms, total: 4.1 s
Wall time: 4.12 s
1000

HuggingFaceDocBuilderDev commented 1 month ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

alex-hh commented 1 month ago

It seems the TypedExamplesIterable can slow down things, it should take formatting into account to not convert numpy arrays to python lists

Should be fixed by updated #7207 I hope!

huggingface / datasets

Preserve features in iterable dataset.filter #7209