salesforce / BLIP

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
BSD 3-Clause "New" or "Revised" License
4.85k stars 648 forks source link

Reproducing the pretrain results on COCO+VG +CC+SBU #124

Closed dyashuni closed 1 year ago

dyashuni commented 1 year ago

Hi @LiJunnan1992, thank you for great work!

I'm trying to reproduce the pretraining on the CC + COCO + SBU + VG dataset. I get higher losses than yours reported here https://github.com/salesforce/BLIP/issues/19#issuecomment-1046398252 I use the following dataset:

  1. COCO (https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_train.json)
  2. Visual genome (https://storage.googleapis.com/sfr-vision-language-research/datasets/vg_caption.json)
  3. CC3M+CC12M+SBU: Filtered web caption (https://storage.googleapis.com/sfr-vision-language-research/BLIP/datasets/ccs_filtered.json)
  4. CC3M+CC12M+SBU: Filtered synthetic caption by ViT-L (https://storage.googleapis.com/sfr-vision-language-research/BLIP/datasets/ccs_synthetic_filtered_large.json)

I didn't balance these datasets. I took the pretrain yaml config from here https://github.com/salesforce/BLIP/blob/main/configs/pretrain.yaml and added new datasets to the training.

Could you please share your yaml config for pretraining on the CC + COCO + SBU + VG dataset ?

LiJunnan1992 commented 1 year ago

Hi @dyashuni, thanks for your interest. Could you take a look at our LAVIS library? https://github.com/salesforce/LAVIS It supports BLIP pre-training among other functions.

dyashuni commented 1 year ago

@LiJunnan1992 thank you, I will take a look at LAVIS

dyashuni commented 1 year ago

Hi @LiJunnan1992 ! I finetuned 3 pretrained models on a COCO caption task using train_caption.py. I used 32 GPU for pretraining.

  1. BLIP w/ ViT-B 14M https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_14M.pth
  2. BLIP w/ ViT-B 129M https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth
  3. BLIP w/ ViT-B and CapFilt-L 129M https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_capfilt_large.pth

And got the following metrics:

  1. BLIP w/ ViT-B 14M images: "val_Bleu_4": 0.403 "val_CIDEr": 1.324
  2. BLIP w/ ViT-B 129M images: "val_Bleu_4": 0.397 "val_CIDEr": 1.318
  3. BLIP w/ ViT-B and CapFilt-L 129M images: "val_Bleu_4": 0.403 "val_CIDEr": 1.338

Model BLIP w/ ViT-B 14M performs almost the same as BLIP w/ ViT-B and CapFilt-L 129M. That contradicts published results...

How is it possible?

LiJunnan1992 commented 1 year ago

Could you reproduce BLIP's fine-tuning result if you use the same setting? " I used 32 GPU for pretraining." -> I assume that you mean "finetuning"?

dyashuni commented 1 year ago

I used your config caption_coco.yaml for finetuning. So I used your params. How many GPUs did you use for finetuning?

" I used 32 GPU for pretraining." -> I assume that you mean "finetuning"? Yes, thank you

LiJunnan1992 commented 1 year ago

I used 8 GPUs. With 32 GPUs, you should set batch_size=8 so that the total batch size remains 256.

dyashuni commented 1 year ago

Thank you! I will try it.