Dealing with Corrupted Images in CC3M

ziqipang commented 1 year ago

Hi! Thank you for your nice work!

I am new to CC3M and wonders what is the standard practice of handling corrupted images. For example, I ran into a lot of errors like Error reading xxx.png with caption xxx: cannot identify image file xxx.

Perhaps the following questions are a little stupid, but I am curious: (1) is this normal when I am directly running Fromage? (2) should I directly exclude such images from the training and validation sets, or what is the standard practice?

Thank you so much for your time!

kohjingyu commented 1 year ago

(1) is this normal when I am directly running Fromage?

This is most likely not an issue with Fromage itself. The exception being throw is here, and most likely it is from the PIL Image.open call. Some things to check:

How many of the image files throw this error?
Do the images exist and are they of the correct format? For example, can you open them with Preview or Chrome or something?
If the images are fine, this might also be a PIL problem

(2) should I directly exclude such images from the training and validation sets, or what is the standard practice?

Typically these images are removed (assuming that it is not a significant number), and this is what we do for the missing images in our CC3M training set (around 200K missing when I downloaded CC3M, see footnote 3 on page 4 of the paper). One reason we may have to do this is because CC3M image urls are removed over time, so depending on when you downloaded it, there will be fewer than 3.3M images available. I might also check your download script to make sure that you are not downloading invalid images.

Hope that helps! Please let me know what you find.

ziqipang commented 1 year ago

@kohjingyu So thrilled to receive your prompt reply!

I checked the problem you mentioned, and here are some preliminary results. The numbers below are on the validation set.

I remove the images that are not downloaded. After this step, I have 13023 images for the validation set.
Then I remove the images that cause the PIL issue above. Indeed, these images cannot be opened by preview. After removing such images, I have 12760 normal images left.

In brief, these images are indeed in the wrong formats (cannot be opened), and I have 12760 normal images for the validation set eventually. Does this number look normal to you?

Thank you for your time and help! Really appreciate it!

kohjingyu commented 1 year ago

Thanks for confirming!

That number looks reasonable, I got around ~13k images for the val set when I downloaded it late last year. I think most papers (including ours) also evaluate mostly on datasets that use MS-COCO or similar images which don't get removed, so evaluation won't really be affected.

ziqipang commented 1 year ago

@kohjingyu Great! Thanks for the help and confirmation!

kohjingyu / fromage

Dealing with Corrupted Images in CC3M #25