huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.28k stars 2.7k forks source link

fix unbatched arrow map for iterable datasets #7204

Closed alex-hh closed 1 month ago

alex-hh commented 1 month ago

Fixes the bug when applying map to an arrow-formatted iterable dataset described here:

https://github.com/huggingface/datasets/issues/6833#issuecomment-2399903885


from datasets import load_dataset
ds = load_dataset("rotten_tomatoes", split="train", streaming=True)
ds = ds.with_format("arrow").map(lambda x: x)
for ex in ds:
    pass

@lhoestq

HuggingFaceDocBuilderDev commented 1 month ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.