Thank you very much for sharing the source code and model weights for BLIP-2. I had a general question about data scale for stage-1 and stage-2 training, it would be great to get your insights on this @dxli94, @LiJunnan1992
I have tried training BLIP-2 from scratch on smaller dataset size ~5M samples due to resource constraints (with eva_clip_g as visual encoder), COCO is part of training dataset
The stage-1 performance on COCO captioning was >110 (CIDEr) and for NoCaps it was (>40 CIDEr)
However, for stage-2 pre-training using the same dataset for FlanT5XL model the performance on COCO drops to (~55 CIDEr) - initializing the checkpoint from previous step
Why do you think the CIDEr performance drops from 110 to 55 on COCO going from stage-1 to stage-2 ? Using the same hyper-parameters provided in the pre-training configs.
Also, when trained longer for 5 epochs it completely collapses and the performance drops to 1.1 CIDEr on COCO. Did you also observe similar effect with longer training for stage-2 ?
Note that the scores mentioned are directly from pre-trained models (without caption fine-tuning).
Any insights/feedback on this would be highly appreciated. Thanks!
It is unusual that the CIDEr performance drop by that much in stage-2. May I know if you have been using the prefix-LM loss for FlanT5-XL? Also, what does the generated captions look like?
Hi,
Thank you very much for sharing the source code and model weights for BLIP-2. I had a general question about data scale for stage-1 and stage-2 training, it would be great to get your insights on this @dxli94, @LiJunnan1992
I have tried training BLIP-2 from scratch on smaller dataset size ~5M samples due to resource constraints (with eva_clip_g as visual encoder), COCO is part of training dataset
Why do you think the CIDEr performance drops from 110 to 55 on COCO going from stage-1 to stage-2 ? Using the same hyper-parameters provided in the pre-training configs.
Also, when trained longer for 5 epochs it completely collapses and the performance drops to 1.1 CIDEr on COCO. Did you also observe similar effect with longer training for stage-2 ?
Note that the scores mentioned are directly from pre-trained models (without caption fine-tuning).
Any insights/feedback on this would be highly appreciated. Thanks!