nishadsinghi / CleanCLIP

Official PyTorch implementation of "CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning" @ ICCV 2023
https://arxiv.org/abs/2303.03323
MIT License
27 stars 2 forks source link

The generated file is missing some image paths #5

Closed tongzhang111 closed 9 months ago

tongzhang111 commented 9 months ago

I found that some data did not have corresponding paths in the generated corresponding files(train.csv).

WeChat97b749e6c8386f54dc1370edf0487d7f
nishadsinghi commented 9 months ago

Hey, @tongzhang111! Thanks for your interest in our work. The behaviour you described is a bit surprising and I'm not sure why this happened. Could you please count the number of rows in the .csv file where a valid path is present? Also, what is the number of files inside the images folder? if the number is in the ballpark of 3M in both cases, then you probably don't need to worry. In this case, I would just modify train.csv to remove the rows that don't have a path in the first column. But, if there are a lot of paths missing, then we might have to investigate further.

tongzhang111 commented 9 months ago

I am very grateful for your reply. Can you compress the data you obtained and release a link. Can I obtain training data through your link?

nishadsinghi commented 9 months ago

The data is really big (it usually takes 2-3 days to download) so I don't think it is practically possible for me to host it somewhere. Were you able to check how many files were downloaded?

tongzhang111 commented 9 months ago

I found out before that there were indeed a relatively small number of images. I think it may be due to network issues. I will try downloading the data again today

nishadsinghi commented 9 months ago

Yes, that is what I would have also suggested. Please feel free to comment again if needed :) Also, in case it wasn't clear, here is how you can download the dataset:

  1. wget https://storage.cloud.google.com/gcc-data/Train/GCC-training.tsv?_ga=2.191230122.-1896153081.1529438250
  2. Run download.py -f GCC-training.tsv -d