Open el-hult opened 2 weeks ago
I first thought this was a mistake of mine, and also posted on stack overflow. https://stackoverflow.com/questions/78913797/iterating-a-huggingface-dataset-from-disk-using-generator-seems-broken-how-to-d
It seems to me the issue is the caching step in
because the shuffle happens after checking the cache, the rng state won't advance if the cache is used. This is VERY confusing. Also not documented.
My proposal is that you remove the API for using a Generator, and only keep the seed-based API since that is functional and cache-compatible.
Describe the bug
Create a dataset. Save it to disk. Load from disk. Shuffle, usning a
np.random.Generator
. Iterate. Shuffle again. Iterate. The iterates are different since the supplied np.random.Generator has progressed between the shuffles.Load dataset from disk again. Shuffle and Iterate. See same result as before. Shuffle and iterate, and this time it does not have the same shuffling as ion previous run.
The motivation is I have a deep learning loop with
where I want a new shuffling at every epoch. Instead I get the same shuffling.
Steps to reproduce the bug
Run the code below two times.
The output is:
The second loop, on the second run, only spits out "741, 741, 741...." which is not the desired output
Expected behavior
I want the dataset to shuffle at every epoch since I provide it with a generator for shuffling.
Environment info
Datasets version 2.21.0 Ubuntu linux.