Open alex-hh opened 1 month ago
I think the problem is that the underlying ex_iterable will not use iter_arrow unless the formatting type is arrow, which leads to conversion from arrow -> python -> numpy in this case rather than arrow -> numpy.
Idea of updated fix is to use the ex_iterable's iter_arrow in any case where it's available and any formatting is specified. The formatter then works directly on arrow tables; the outputs of the formatter get passed to the function to be mapped.
With updated version:
import numpy as np
import time
from datasets import Dataset, Features, Array3D
features=Features(**{"array0": Array3D((None, 10, 10), dtype="float32"), "array1": Array3D((None,10,10), dtype="float32")})
dataset = Dataset.from_dict({f"array{i}": [np.zeros((x,10,10), dtype=np.float32) for x in [2000,1000]*25] for i in range(2)}, features=features)
ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy").map(lambda x: x, batched=True, batch_size=10)
t0 = time.time()
for ex in ds:
pass
t1 = time.time()
Total time: < 0.01s (~30s on main)
ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy").map(lambda x: x, batched=False)
t0 = time.time()
for ex in ds:
pass
t1 = time.time()
Time: ~0.02 s (~30s on main)
ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy")
t0 = time.time()
for ex in ds:
pass
t1 = time.time()
Time: ~0.02s
also now working for filter with similar performance improvements:
filtered_examples = []
ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy").filter(lambda x: [arr.shape[0]==2000 for arr in x["array0"]], batch_size=10, batched=True)
t0 = time.time()
for ex in ds:
filtered_examples.append(ex)
t1 = time.time()
assert len(filtered_examples) == 25
0.01s vs 50s on main
filtered_examples = []
ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy").filter(lambda x: x["array0"].shape[0]==2000, batched=False)
t0 = time.time()
for ex in ds:
filtered_examples.append(ex)
t1 = time.time()
assert len(filtered_examples) == 25
0.04s vs 50s on main
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
(the distributed tests failing in the CI are unrelated)
There also appears to be a separate? issue with chaining filter and map bc filter iter_arrow only returns _iter_arrow if arrow formatting is applied (and vv presumably)
I don't have a good minimal example atm
issue with chaining filter and map bc filter iter_arrow only returns _iter_arrow if arrow formatting is applied (and vv presumably)
Maybe related to this issue ?
ds = Dataset.from_dict({"a": range(10)}).to_iterable_dataset()
ds = ds.with_format("arrow").map(lambda x: x, features=Features({"a": Value("string")})).with_format(None)
print(list(ds)) # yields integers instead of strings
I feel like we could get rid of TypedExampleIterable altogether and apply formatting with feature conversion with formatted_python_examples_iterator
and formatted_arrow_examples_iterator
btw you can pass features=
in get_formatter()
to get a formatter that does the feature conversion at the same time as formatting
(edit:
except maybe the arrow formatter doesn't use features
yet, we can fix it like this if it's really needed
class ArrowFormatter(Formatter[pa.Table, pa.Array, pa.Table]):
def format_row(self, pa_table: pa.Table) -> pa.Table:
- return self.simple_arrow_extractor().extract_row(pa_table)
+ pa_table = self.simple_arrow_extractor().extract_row(pa_table)
+. return cast_table_to_features(pa_table, self.features) if self.features else pa_table
)
I feel like we could get rid of TypedExampleIterable altogether and apply formatting with feature conversion with formatted_python_examples_iterator and formatted_arrow_examples_iterator
Oh nice didn't know about the feature support in get_formatter. Haven't thought through whether this works but would a FormattedExampleIterable (with feature conversion) be able to solve this and fit the API better?
Oh nice didn't know about the feature support in get_formatter. Haven't thought through whether this works but would a FormattedExampleIterable (with feature conversion) be able to solve this and fit the API better?
Yes this is surely the way to go actually !
ok i've fixed the chaining issue with my last two commits.
Will see if I can refactor into a FormattedExampleIterable
The other issue you posted seems to be unrelated (maybe something to do with feature decoding?)
updated with FormattedExamplesIterable.
there might be a few unnecessary format calls once the data is already formatted - doesn't seem like a big performance bottleneck but could maybe be fixed with e.g. an is_formatted property
It also might be possible to do a wider refactor and use FormattedExamplesIterable elsewhere. But I'd personally prefer not to try that rn.
Thinking about this in the context of #7210 - am wondering if it would make sense for Features to define their own extraction arrow->object logic? e.g. Arrays should always be extracted with NumpyArrowExtractor, not only in case with_format is set to numpy (which a user can easily forget or not know to do)
Thinking about this in the context of https://github.com/huggingface/datasets/issues/7210 - am wondering if it would make sense for Features to define their own extraction arrow->object logic? e.g. Arrays should always be extracted with NumpyArrowExtractor, not only in case with_format is set to numpy (which a user can easily forget or not know to do)
For ArrayND
they already implement to_pylist
to decode arrow data and it can be updated to return a numpy array (see the ArrayExtensionArray
class for more details)
@lhoestq im no longer sure my specific concern about with_format(None) was well-founded - I didn't appreciate that the python formatter tries to do nothing to python objects including numpy arrays, so the existing with_format(None) should I think do what I want. Do you think with_format(None) is ok as is after all? If so think this is hopefully ready for final review!
@lhoestq I've updated to make compatible with latest changes on main, and think the current with_format None behaviour is probably fine - please let me know if there's anything else I can do!
Hi Alex, I will be less available from today and for a week. I'll review your PR and play with it once I come back if you don't mind !
I got to this by hacking around a bit but it seems to solve #7206
I have no idea if this approach makes sense or would break something else?
Could maybe work on a full pr if this looks reasonable @lhoestq ? I imagine the same issue might affect other iterable dataset methods?