Effect of Dataset size on Stage-1 and Stage-2 BLIP-2 pre-training

Hi,

Thank you very much for sharing the source code and model weights for BLIP-2. I had a general question about data scale for stage-1 and stage-2 training, it would be great to get your insights on this @dxli94, @LiJunnan1992

I have tried training BLIP-2 from scratch on smaller dataset size ~5M samples due to resource constraints (with eva_clip_g as visual encoder), COCO is part of training dataset

The stage-1 performance on COCO captioning was >110 (CIDEr) and for NoCaps it was (>40 CIDEr)
However, for stage-2 pre-training using the same dataset for FlanT5XL model the performance on COCO drops to (~55 CIDEr) - initializing the checkpoint from previous step

Why do you think the CIDEr performance drops from 110 to 55 on COCO going from stage-1 to stage-2 ? Using the same hyper-parameters provided in the pre-training configs.

Also, when trained longer for 5 epochs it completely collapses and the performance drops to 1.1 CIDEr on COCO. Did you also observe similar effect with longer training for stage-2 ?

Note that the scores mentioned are directly from pre-trained models (without caption fine-tuning).

Any insights/feedback on this would be highly appreciated. Thanks!

salesforce / LAVIS

Effect of Dataset size on Stage-1 and Stage-2 BLIP-2 pre-training #407