mlfoundations / datacomp

DataComp: In search of the next generation of multimodal datasets
http://datacomp.ai/
Other
587 stars 49 forks source link

Pretraining dataset #73

Open mactavish91 opened 6 months ago

mactavish91 commented 6 months ago

Thank you for your excellent work. I'm currently training my own CLIP model and have a question. If I use LAION-2B, COYO-700M, and Datacomp datasets simultaneously for training, will it yield better results? Should I perform data deduplication?

gabrielilharco commented 6 months ago

Hi @mactavish91, we don't have those exact experiments, but there are some relevant ones in Table 18 or our paper (https://arxiv.org/pdf/2304.14108.pdf)