Closed alex-hh closed 1 month ago
Yes your assumption on concatenate/interleave is ok imo.
It seems the TypedExamplesIterable can slow down things, it should take formatting into account to not convert numpy arrays to python lists
right now it's slow (unrelatedly to your PR):
>>> ds = Dataset.from_dict({"a": np.zeros((1000, 32, 32))}).to_iterable_dataset().with_format("np")
>>> filtered_ds = ds.filter(lambda x: True)
>>> %time sum(1 for _ in ds)
CPU times: user 175 ms, sys: 8.1 ms, total: 183 ms
Wall time: 184 ms
1000
>>> %time sum(1 for _ in filtered_ds)
CPU times: user 4.1 s, sys: 8.41 ms, total: 4.1 s
Wall time: 4.12 s
1000
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
It seems the TypedExamplesIterable can slow down things, it should take formatting into account to not convert numpy arrays to python lists
Should be fixed by updated #7207 I hope!
Fixes example in #7208 - I'm not sure what other checks I should do? @lhoestq
I also haven't thought hard about the concatenate / interleaving example iterables but think this might work assuming that features are either all identical or None?