Open habanoz opened 1 day ago
Hi, good spot I do remember noticing this also, just forgot to create a PR. Issue is with the fact that it's private method and got change month ago: https://github.com/huggingface/datasets/commit/65f6eb54aa0e8bb44cea35deea28e0e8fecc25b9#diff-edc4da5f2179552e25f4f3dc9d6bf07265b68bbef048a8f712e798520a23d048L103
So now the args are different.
Do you think you could implement the fix? (fix the line + bump datasets so that it doesn't clash)
@hynky1999 I have created PR #309 .
The bug is self-evident.
ex_iterable = dst._ex_iterable.shard_data_sources(rank, world_size)
The placement of rank and world size parameters is not correct. rank is assigned to num_shards parameter and world_size is assigned to index parameter.
https://github.com/huggingface/datasets/blob/06c3235a640d00bf59223ebabf3cb489a2891767/src/datasets/iterable_dataset.py#L144
This bug ruins sharding in streaming mode.