huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
2.07k stars 152 forks source link

Bug in HuggingfaceDatasetReader in streaming mode #308

Open habanoz opened 1 day ago

habanoz commented 1 day ago

The bug is self-evident.

ex_iterable = dst._ex_iterable.shard_data_sources(rank, world_size)

The placement of rank and world size parameters is not correct. rank is assigned to num_shards parameter and world_size is assigned to index parameter.

https://github.com/huggingface/datasets/blob/06c3235a640d00bf59223ebabf3cb489a2891767/src/datasets/iterable_dataset.py#L144

This bug ruins sharding in streaming mode.

hynky1999 commented 1 day ago

Hi, good spot I do remember noticing this also, just forgot to create a PR. Issue is with the fact that it's private method and got change month ago: https://github.com/huggingface/datasets/commit/65f6eb54aa0e8bb44cea35deea28e0e8fecc25b9#diff-edc4da5f2179552e25f4f3dc9d6bf07265b68bbef048a8f712e798520a23d048L103

So now the args are different.

Do you think you could implement the fix? (fix the line + bump datasets so that it doesn't clash)

habanoz commented 1 day ago

@hynky1999 I have created PR #309 .