salesforce / ALBEF

Code for ALBEF: a new vision-language pre-training method
BSD 3-Clause "New" or "Revised" License
1.53k stars 195 forks source link

pretraining datasets json files #2

Closed jayleicn closed 3 years ago

jayleicn commented 3 years ago

Hi @LiJunnan1992,

Congrats on your great work, and thanks for releasing the code!! To help reproduce the pretraining experiments, could you release the dataset json files for the pretraining datasets as well? Thanks!

Best, Jie

LiJunnan1992 commented 3 years ago

Hi,

Here are the dataset json files (I've also updated the readme). https://storage.googleapis.com/sfr-pcl-data-research/ALBEF/json_pretrain.zip For each json file, you need to change the image paths to adapt to your own directory.

Thanks!

jayleicn commented 3 years ago

Thanks! This is very helpful. A related question here, I noticed the released VG captions only contain 769K captions (also see Table 8 in this work, screenshot 1 below), while UNITER has 5M VG captions (see Table 1 in UNITER, screenshot 2 below). Is there any sort of filtering used in obtaining the 769K captions from the 5M captions? Could you elaborate this process?

image image

LiJunnan1992 commented 3 years ago

Yes there are 4 filters for VG:

  1. remove samples that occur in the evaluation set of COCO or RefCOCO+
  2. remove duplicate sentences for each image
  3. remove sentences whose corresponding region has an area size that is <20% of the image's area size
  4. remove sentences that have fewer than 4 words
jayleicn commented 3 years ago

Thanks for the prompt reply, I am closing this issue.

jayleicn commented 3 years ago

Yes there are 4 filters for VG:

  1. remove samples that occur in the evaluation set of COCO or RefCOCO+
  2. remove duplicate sentences for each image
  3. remove sentences whose corresponding region has an area size that is <20% of the image's area size
  4. remove sentences that have fewer than 4 words

Hi @LiJunnan1992, I have some follow-up questions on filters 3 and 4: (1) Do you have some intuitions on using them? (2) Did you conduct experiments comparing the difference with and without them?

Thanks!

LiJunnan1992 commented 3 years ago

Because we perform random image crop during training, filter (3) will decrease the chance that the text does not describe the image. Filter (4) aims to remove less informative text.

Our experiments do not show any significant difference, so we keep them removed because it decreases training time.

jayleicn commented 3 years ago

This makes sense. Thanks!