mlfoundations / open_clip

An open source implementation of CLIP.
Other
9.28k stars 922 forks source link

Shuffling Multiple Datasets in Webdataset format #547

Open JJumSSu opened 1 year ago

JJumSSu commented 1 year ago

Hi, thank you for the amazing repo!

I'm currently trying to train a CLIP model using multiple datasets in a webdataset format. While doing so, I have some questions regarding the shuffling.

  1. If using multiple training datasets, can I shuffle the multiple datasets while training? Or are they sequentially trained? (e.g., data1 -> data2 -> data3 ...)
  2. Seems that the parameter --dataset-resampled shuffles the shards with replacement. So does it mean that some instances will be trained more than one time and some of them will not be trained at all? If so, what is the advantage of using the parameter?

Thank you :)

gabrielilharco commented 9 months ago

Hi @JJumSSu. Re. 1, all shards are shuffled. Re. 2, the advantage here is that it allows us to save checkpoints more frequently (at fractions of an epoch) by setting --train-num-samples to a lower value. This is important for larger datasets