mosaicml / streaming

A Data Streaming Library for Efficient Neural Network Training
https://streaming.docs.mosaicml.com
Apache License 2.0
1.08k stars 136 forks source link

Different lists of examples when shuffle == False #749

Open experiencor opened 1 month ago

experiencor commented 1 month ago

OS: [macOS] mosaicml-streaming==0.7.6

To reproduce

Steps to reproduce the behavior:

from streaming import StreamingDataset, Stream

streams = [
    Stream(
        local="/tmp/dataset",
        remote='s3://path/to/dataset',
        choose=10,
    ),
]

dataset = StreamingDataset(
    streams=streams,
    shuffle=False,
    batch_size=1,
    predownload=8
)

for example in dataset:
    print(example["id"])

print("=" * 10)

for example in dataset:
    print(example["id"])

Expected behavior

The same list of examples for 2 iterations of dataset when shuffle = False.

Actual behavior

Different list of examples for 2 interations of dataset.

Screenshot 2024-08-13 at 6 23 11 PM

XiaohanZhangCMU commented 1 month ago

@experiencor Reason is that, with "choose' in the stream, StreamingDataset will do upsample/downsample per stream. The random seed depends on epochs. So when you compare 1st epoch with 2nd epoch, they will be different. The randomness comes from here

Although it is a bit counterintuitive since you have shuffle=False. We'll put in a fix for it hopefully soon.