nishadsinghi / CleanCLIP

Official PyTorch implementation of "CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning" @ ICCV 2023
https://arxiv.org/abs/2303.03323
MIT License
27 stars 2 forks source link

how to use the utils/download.py #4

Closed t1307109256 closed 10 months ago

t1307109256 commented 10 months ago

Hello author, I am a novice and would like to ask how to use the utils/download.py script to download the images from their URL for CC3M and/or CC12M. Can you give me an example?

nishadsinghi commented 10 months ago

Hi! Thanks for your interest in our work. You can follow these steps to download CC3M:

  1. wget https://storage.cloud.google.com/gcc-data/Train/GCC-training.tsv?_ga=2.191230122.-1896153081.1529438250
  2. Run download.py -f GCC-training.tsv -d

It takes a few days to download the entire dataset.

Closing this for now. Feel free to reopen/ comment if you have further questions :)

t1307109256 commented 7 months ago

Thanks,but I have another question.How do I download the validation set? The pre-training requires validation set: python -m src.main --name exp1 --train_data <path to (poisoned) train csv file> --validation_data --image_key <column name of the image paths in the train/validation csv file> --caption_key <column name of the captions in the train/validation csv file> --device_ids 0 1 2 3 --distributed

nishadsinghi commented 7 months ago

This is just the ImageNet validation set; does that answer the question?

t1307109256 commented 6 months ago

However, there is no caption column in the labels.csv file of the ImageNet validation set, and an error will be reported when setting --caption_key to caption.

nishadsinghi commented 6 months ago

Hey! I checked some stuff and it looks like you may not need to specify validation data. Can you try to remove it from the command and run again?