Optionally make data loading more deterministic

huggingface / open-muse

Open reproduction of MUSE for fast text2image generation.

https://huggingface.co/openMUSE

Apache License 2.0

334 stars 27 forks source link

Optionally make data loading more deterministic #71

Open isamu-isozaki opened 1 year ago

isamu-isozaki commented 1 year ago

We randomly resample the shards (with replacement) and sample examples in buffer for training every time we resume/start the training run. This means our data loading is not determinitsic. We also don't do epoch based training but just using this for book keeping and being able to reuse the same training loop with other datasets/loaders.

Optionally make this more deterministic for reproducibility

pcuenca commented 1 year ago

I recently came across Mosaic ML's dataset streaming library: https://github.com/mosaicml/streaming. Haven't used it yet, but it looks interesting.

isamu-isozaki commented 1 year ago

@pcuenca Thanks for the link!