ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.72k stars 5.73k forks source link

[Data] `from_huggingface()` does not support `override_num_blocks` for non-streaming HF Datasets #47507

Closed scottjlee closed 1 month ago

scottjlee commented 1 month ago

What happened + What you expected to happen

When using override_num_blocks with non-streaming Hugging Face datasets, the parameter is ignored. The current suggested workaround is to add a .repartition(N) operator after from_huggingface().

We should throw an exception in the case that non-streaming HF dataset is passed, and override_num_blocks is provided.

In addition, we should ensure to pass the override_num_blocks parameter and other inputs into the fallback path for iterable Datasets: https://github.com/ray-project/ray/blob/ca8592251741b3419df91a1eec1ccc329bd7b86b/python/ray/data/read_api.py#L2845

Versions / Dependencies

2.35

Reproduction script

import ray
from datasets import load_dataset

dataset = load_dataset("GEM/viggo", trust_remote_code=True)
train_set = dataset["train"]
train_ds = ray.data.from_huggingface(train_set, override_num_blocks=4)
assert train_ds.num_blocks() == 4 # False, train_ds.num_blocks() returns 1

Issue Severity

Medium: It is a significant difficulty but I can work around it.

xingyu-long commented 1 month ago

Hi @scottjlee, could I take this one? thanks!

scottjlee commented 1 month ago

yeah feel free to go for it @xingyu-long! let me know if you have any questions and feel free to assign the PR to me once ready.

xingyu-long commented 1 month ago

yeah feel free to go for it @xingyu-long! let me know if you have any questions and feel free to assign the PR to me once ready.

Just submitted the PR https://github.com/ray-project/ray/pull/47559.

@scottjlee it seems I cannot make you as assignee, could you take a look at above link to review when you have time? Thanks!