Closed scottjlee closed 1 month ago
Hi @scottjlee, could I take this one? thanks!
yeah feel free to go for it @xingyu-long! let me know if you have any questions and feel free to assign the PR to me once ready.
yeah feel free to go for it @xingyu-long! let me know if you have any questions and feel free to assign the PR to me once ready.
Just submitted the PR https://github.com/ray-project/ray/pull/47559.
@scottjlee it seems I cannot make you as assignee, could you take a look at above link to review when you have time? Thanks!
What happened + What you expected to happen
When using
override_num_blocks
with non-streaming Hugging Face datasets, the parameter is ignored. The current suggested workaround is to add a.repartition(N)
operator afterfrom_huggingface()
.We should throw an exception in the case that non-streaming HF dataset is passed, and
override_num_blocks
is provided.In addition, we should ensure to pass the
override_num_blocks
parameter and other inputs into the fallback path for iterable Datasets: https://github.com/ray-project/ray/blob/ca8592251741b3419df91a1eec1ccc329bd7b86b/python/ray/data/read_api.py#L2845Versions / Dependencies
2.35
Reproduction script
Issue Severity
Medium: It is a significant difficulty but I can work around it.