apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets

alex-hh commented 1 month ago

I got to this by hacking around a bit but it seems to solve #7206

I have no idea if this approach makes sense or would break something else?

Could maybe work on a full pr if this looks reasonable @lhoestq ? I imagine the same issue might affect other iterable dataset methods?

alex-hh commented 1 month ago

I think the problem is that the underlying ex_iterable will not use iter_arrow unless the formatting type is arrow, which leads to conversion from arrow -> python -> numpy in this case rather than arrow -> numpy.

Idea of updated fix is to use the ex_iterable's iter_arrow in any case where it's available and any formatting is specified. The formatter then works directly on arrow tables; the outputs of the formatter get passed to the function to be mapped.

With updated version:

import numpy as np
import time
from datasets import Dataset, Features, Array3D

features=Features(**{"array0": Array3D((None, 10, 10), dtype="float32"), "array1": Array3D((None,10,10), dtype="float32")})
dataset = Dataset.from_dict({f"array{i}": [np.zeros((x,10,10), dtype=np.float32) for x in [2000,1000]*25] for i in range(2)}, features=features)

ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy").map(lambda x: x, batched=True, batch_size=10)
t0 = time.time()
for ex in ds:
    pass
t1 = time.time()

Total time: < 0.01s (~30s on main)

ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy").map(lambda x: x, batched=False)
t0 = time.time()
for ex in ds:
    pass
t1 = time.time()

Time: ~0.02 s (~30s on main)

ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy")
t0 = time.time()
for ex in ds:
    pass
t1 = time.time()

Time: ~0.02s

alex-hh commented 1 month ago

also now working for filter with similar performance improvements:

filtered_examples = []
ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy").filter(lambda x: [arr.shape[0]==2000 for arr in x["array0"]], batch_size=10, batched=True)
t0 = time.time()
for ex in ds:
    filtered_examples.append(ex)
t1 = time.time()
assert len(filtered_examples) == 25

0.01s vs 50s on main

filtered_examples = []
ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy").filter(lambda x: x["array0"].shape[0]==2000, batched=False)
t0 = time.time()
for ex in ds:
    filtered_examples.append(ex)
t1 = time.time()
assert len(filtered_examples) == 25

0.04s vs 50s on main

HuggingFaceDocBuilderDev commented 1 month ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lhoestq commented 1 month ago

(the distributed tests failing in the CI are unrelated)

alex-hh commented 1 month ago

There also appears to be a separate? issue with chaining filter and map bc filter iter_arrow only returns _iter_arrow if arrow formatting is applied (and vv presumably)

I don't have a good minimal example atm

lhoestq commented 1 month ago

issue with chaining filter and map bc filter iter_arrow only returns _iter_arrow if arrow formatting is applied (and vv presumably)

Maybe related to this issue ?

ds = Dataset.from_dict({"a": range(10)}).to_iterable_dataset()
ds = ds.with_format("arrow").map(lambda x: x, features=Features({"a": Value("string")})).with_format(None)
print(list(ds))  # yields integers instead of strings

lhoestq commented 1 month ago

I feel like we could get rid of TypedExampleIterable altogether and apply formatting with feature conversion with formatted_python_examples_iterator and formatted_arrow_examples_iterator

btw you can pass features= in get_formatter() to get a formatter that does the feature conversion at the same time as formatting

(edit:

except maybe the arrow formatter doesn't use features yet, we can fix it like this if it's really needed

class ArrowFormatter(Formatter[pa.Table, pa.Array, pa.Table]):
    def format_row(self, pa_table: pa.Table) -> pa.Table:
-         return self.simple_arrow_extractor().extract_row(pa_table)
+         pa_table = self.simple_arrow_extractor().extract_row(pa_table)
+.        return cast_table_to_features(pa_table, self.features) if self.features else pa_table

)

alex-hh commented 1 month ago

I feel like we could get rid of TypedExampleIterable altogether and apply formatting with feature conversion with formatted_python_examples_iterator and formatted_arrow_examples_iterator

Oh nice didn't know about the feature support in get_formatter. Haven't thought through whether this works but would a FormattedExampleIterable (with feature conversion) be able to solve this and fit the API better?

lhoestq commented 1 month ago

Oh nice didn't know about the feature support in get_formatter. Haven't thought through whether this works but would a FormattedExampleIterable (with feature conversion) be able to solve this and fit the API better?

Yes this is surely the way to go actually !

alex-hh commented 1 month ago

ok i've fixed the chaining issue with my last two commits.

Will see if I can refactor into a FormattedExampleIterable

The other issue you posted seems to be unrelated (maybe something to do with feature decoding?)

alex-hh commented 1 month ago

updated with FormattedExamplesIterable.

there might be a few unnecessary format calls once the data is already formatted - doesn't seem like a big performance bottleneck but could maybe be fixed with e.g. an is_formatted property

It also might be possible to do a wider refactor and use FormattedExamplesIterable elsewhere. But I'd personally prefer not to try that rn.

alex-hh commented 1 month ago

Thinking about this in the context of #7210 - am wondering if it would make sense for Features to define their own extraction arrow->object logic? e.g. Arrays should always be extracted with NumpyArrowExtractor, not only in case with_format is set to numpy (which a user can easily forget or not know to do)

lhoestq commented 1 month ago

Thinking about this in the context of https://github.com/huggingface/datasets/issues/7210 - am wondering if it would make sense for Features to define their own extraction arrow->object logic? e.g. Arrays should always be extracted with NumpyArrowExtractor, not only in case with_format is set to numpy (which a user can easily forget or not know to do)

For ArrayND they already implement to_pylist to decode arrow data and it can be updated to return a numpy array (see the ArrayExtensionArray class for more details)

alex-hh commented 4 weeks ago

@lhoestq im no longer sure my specific concern about with_format(None) was well-founded - I didn't appreciate that the python formatter tries to do nothing to python objects including numpy arrays, so the existing with_format(None) should I think do what I want. Do you think with_format(None) is ok as is after all? If so think this is hopefully ready for final review!

alex-hh commented 3 weeks ago

@lhoestq I've updated to make compatible with latest changes on main, and think the current with_format None behaviour is probably fine - please let me know if there's anything else I can do!

lhoestq commented 2 weeks ago

Hi Alex, I will be less available from today and for a week. I'll review your PR and play with it once I come back if you don't mind !

huggingface / datasets

apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets #7207