kohjingyu / fromage

🧀 Code and models for the ICML 2023 paper "Grounding Language Models to Images for Multimodal Inputs and Outputs".
https://jykoh.com/fromage
Apache License 2.0
474 stars 35 forks source link

Dealing with Corrupted Images in CC3M #25

Closed ziqipang closed 1 year ago

ziqipang commented 1 year ago

Hi! Thank you for your nice work!

I am new to CC3M and wonders what is the standard practice of handling corrupted images. For example, I ran into a lot of errors like Error reading xxx.png with caption xxx: cannot identify image file xxx.

Perhaps the following questions are a little stupid, but I am curious: (1) is this normal when I am directly running Fromage? (2) should I directly exclude such images from the training and validation sets, or what is the standard practice?

Thank you so much for your time!

kohjingyu commented 1 year ago

(1) is this normal when I am directly running Fromage?

This is most likely not an issue with Fromage itself. The exception being throw is here, and most likely it is from the PIL Image.open call. Some things to check:

(2) should I directly exclude such images from the training and validation sets, or what is the standard practice?

Typically these images are removed (assuming that it is not a significant number), and this is what we do for the missing images in our CC3M training set (around 200K missing when I downloaded CC3M, see footnote 3 on page 4 of the paper). One reason we may have to do this is because CC3M image urls are removed over time, so depending on when you downloaded it, there will be fewer than 3.3M images available. I might also check your download script to make sure that you are not downloading invalid images.

Hope that helps! Please let me know what you find.

ziqipang commented 1 year ago

@kohjingyu So thrilled to receive your prompt reply!

I checked the problem you mentioned, and here are some preliminary results. The numbers below are on the validation set.

In brief, these images are indeed in the wrong formats (cannot be opened), and I have 12760 normal images for the validation set eventually. Does this number look normal to you?

Thank you for your time and help! Really appreciate it!

kohjingyu commented 1 year ago

Thanks for confirming!

That number looks reasonable, I got around ~13k images for the val set when I downloaded it late last year. I think most papers (including ours) also evaluate mostly on datasets that use MS-COCO or similar images which don't get removed, so evaluation won't really be affected.

ziqipang commented 1 year ago

@kohjingyu Great! Thanks for the help and confirmation!