nishadsinghi / CleanCLIP

Official PyTorch implementation of "CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning" @ ICCV 2023
https://arxiv.org/abs/2303.03323
MIT License
27 stars 2 forks source link

About training CLIP with CC3M data #8

Closed Seung-B closed 4 months ago

Seung-B commented 4 months ago

Hello,

I would like to train the CLIP model from scratch on CC3M data. However, if I try to create and use the dataloader, the process ends as the RAM usage increases significantly. Maybe it's because the data is large, have you experienced any similar problems?

nishadsinghi commented 4 months ago

Hey @Seung-B, thanks for your interest in our work! We did not encounter this issue. Have you tried reducing the batch-size?

Seung-B commented 4 months ago

Thanks for your response! CC3M data cannot be fully downloaded now, so I am using the data in the hugging face(2.95M). There are many links where CC3M data has expired, so I can get only 2.2M data. Do you know where I can get full data?

Seung-B commented 4 months ago

Thank you for responding to my question! I used the Hugging Face dataset, and even if it takes a long time, I obtained 2.95M cc3m data by saving them locally one by one.

https://huggingface.co/datasets/pixparse/cc3m-wds