Shuffling Multiple Datasets in Webdataset format

mlfoundations / open_clip

An open source implementation of CLIP.

Other

9.28k stars 922 forks source link

Hi, thank you for the amazing repo!

I'm currently trying to train a CLIP model using multiple datasets in a webdataset format. While doing so, I have some questions regarding the shuffling.

If using multiple training datasets, can I shuffle the multiple datasets while training? Or are they sequentially trained? (e.g., data1 -> data2 -> data3 ...)
Seems that the parameter --dataset-resampled shuffles the shards with replacement. So does it mean that some instances will be trained more than one time and some of them will not be trained at all? If so, what is the advantage of using the parameter?

Thank you :)

mlfoundations / open_clip

Shuffling Multiple Datasets in Webdataset format #547