salesforce / LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence
BSD 3-Clause "New" or "Revised" License
9.93k stars 973 forks source link

Effect of Dataset size on Stage-1 and Stage-2 BLIP-2 pre-training #407

Open Swetha5 opened 1 year ago

Swetha5 commented 1 year ago

Hi,

Thank you very much for sharing the source code and model weights for BLIP-2. I had a general question about data scale for stage-1 and stage-2 training, it would be great to get your insights on this @dxli94, @LiJunnan1992

I have tried training BLIP-2 from scratch on smaller dataset size ~5M samples due to resource constraints (with eva_clip_g as visual encoder), COCO is part of training dataset

Why do you think the CIDEr performance drops from 110 to 55 on COCO going from stage-1 to stage-2 ? Using the same hyper-parameters provided in the pre-training configs.

Also, when trained longer for 5 epochs it completely collapses and the performance drops to 1.1 CIDEr on COCO. Did you also observe similar effect with longer training for stage-2 ?

Note that the scores mentioned are directly from pre-trained models (without caption fine-tuning).

Any insights/feedback on this would be highly appreciated. Thanks!

LiJunnan1992 commented 1 year ago

Thanks for your questions.

It is unusual that the CIDEr performance drop by that much in stage-2. May I know if you have been using the prefix-LM loss for FlanT5-XL? Also, what does the generated captions look like?