BLIP-2 paper finetune replicate low performance: BLEU_4 score is 0.15 for finetuning COCO_caption

LuoyaoChen commented 1 month ago

Hi,

First of all, thanks for the great work!

Issue I encountered:

I am trying to replicate the BLIP-2 paper, Table3, Screenshot 2024-05-29 at 23 36 23

I,.e. I ran the finetuning COCO Captioning finetuning using the script: bash LAVIS/run_scripts/blip2/train/train_caption_coco_from_scratch.sh I fine-tuned for 5 epochs, using batch size 256; freeze_vit = False. However, fine-tune loss remains at around 1.76 and I obtained BLEU_4 = 0.158 Where the paper had BLEU@4 = 43.5

How I got this

Before getting this low performance, I encountered and debugged this error in opt_model.generate(): input length of input_ids is 0, but `max_length` is set to -6. this can lead to unexpected behavior. you should consider increasing `max_length` or, better yet, setting `max_new_tokens`.

After inspection, the above bug was from this line: https://github.com/salesforce/LAVIS/blob/59273f651b9bffb193d1b12a235e909e9f826dda/lavis/models/blip2_models/blip2_opt.py#L226 where input sequence is 4, and num_query = 32. Concated, becomes 36. In the yaml, however, it set the max_len = 30. Consequently, 30-36=-6 I changed the max_len = 40, and error was gone.

My question is

However, I am suspecting this bug was the fundamental issue that led to BLEU_4 = 0.15 Otherwise, could you please point at what I should change in order to replicate Table3?

Thank you! Appreciate your reply and help

WeichengDai1 commented 1 month ago

Hello, I faced the same problem. Could the authors take a look at it? Thank you so much!

lxr-1204 commented 1 month ago

Hello, I ran the finetuning COCO Captioning finetuning using the script and change the batchsize to 32 bash run_scripts/blip2/train/train_caption_coco.sh,but not LAVIS/run_scripts/blip2/train/train_caption_coco_from_scratch.sh. And I got results {"val": {"Bleu_1": 0.8238268241644443, "Bleu_2": 0.6831525165191635, "Bleu_3": 0.5459414870155539, "Bleu_4": 0.42831303869532394, "METEOR": 0.30180281319130425, "ROUGE_L": 0.6091627079501748, "CIDEr": 1.374218736104006, "SPICE": 0.23455270984413482}} {"val": {"Bleu_1": 0.8298774354937364, "Bleu_2": 0.6908474983471536, "Bleu_3": 0.5532925604069465, "Bleu_4": 0.43515568592502296, "METEOR": 0.30497520102164544, "ROUGE_L": 0.613052542301079, "CIDEr": 1.3963855142315682, "SPICE": 0.23759949992763962}} {"val": {"Bleu_1": 0.8302620239059283, "Bleu_2": 0.6921764063375653, "Bleu_3": 0.5552159289224236, "Bleu_4": 0.43842650265562216, "METEOR": 0.3060731453972931, "ROUGE_L": 0.6135836343823102, "CIDEr": 1.407976274775359, "SPICE": 0.23905837952464407}} {"val": {"Bleu_1": 0.8274924247946817, "Bleu_2": 0.6931325074722023, "Bleu_3": 0.5597664645378789, "Bleu_4": 0.4443899752112241, "METEOR": 0.3066574635742406, "ROUGE_L": 0.6172149598445694, "CIDEr": 1.4154786962527268, "SPICE": 0.23882330241316768}} {"val": {"Bleu_1": 0.8328429952300106, "Bleu_2": 0.6960647900571959, "Bleu_3": 0.5604930980057795, "Bleu_4": 0.4433169242234679, "METEOR": 0.3082065717220316, "ROUGE_L": 0.6171616222938877, "CIDEr": 1.4218121069160456, "SPICE": 0.2402601718946845}} {"test": {"Bleu_1": 0.8293025112126673, "Bleu_2": 0.6928262979474319, "Bleu_3": 0.5591914780467612, "Bleu_4": 0.4419784673449433, "METEOR": 0.30831059059342697, "ROUGE_L": 0.6176090226726898, "CIDEr": 1.4323682191311553, "SPICE": 0.24242663638610848}} I don‘t konw if I success? 🧸 Looking forward your reply!

WeichengDai1 commented 1 month ago

Hello, I ran the finetuning COCO Captioning finetuning using the script and change the batchsize to 32 bash run_scripts/blip2/train/train_caption_coco.sh,but not LAVIS/run_scripts/blip2/train/train_caption_coco_from_scratch.sh. And I got results {"val": {"Bleu_1": 0.8238268241644443, "Bleu_2": 0.6831525165191635, "Bleu_3": 0.5459414870155539, "Bleu_4": 0.42831303869532394, "METEOR": 0.30180281319130425, "ROUGE_L": 0.6091627079501748, "CIDEr": 1.374218736104006, "SPICE": 0.23455270984413482}} {"val": {"Bleu_1": 0.8298774354937364, "Bleu_2": 0.6908474983471536, "Bleu_3": 0.5532925604069465, "Bleu_4": 0.43515568592502296, "METEOR": 0.30497520102164544, "ROUGE_L": 0.613052542301079, "CIDEr": 1.3963855142315682, "SPICE": 0.23759949992763962}} {"val": {"Bleu_1": 0.8302620239059283, "Bleu_2": 0.6921764063375653, "Bleu_3": 0.5552159289224236, "Bleu_4": 0.43842650265562216, "METEOR": 0.3060731453972931, "ROUGE_L": 0.6135836343823102, "CIDEr": 1.407976274775359, "SPICE": 0.23905837952464407}} {"val": {"Bleu_1": 0.8274924247946817, "Bleu_2": 0.6931325074722023, "Bleu_3": 0.5597664645378789, "Bleu_4": 0.4443899752112241, "METEOR": 0.3066574635742406, "ROUGE_L": 0.6172149598445694, "CIDEr": 1.4154786962527268, "SPICE": 0.23882330241316768}} {"val": {"Bleu_1": 0.8328429952300106, "Bleu_2": 0.6960647900571959, "Bleu_3": 0.5604930980057795, "Bleu_4": 0.4433169242234679, "METEOR": 0.3082065717220316, "ROUGE_L": 0.6171616222938877, "CIDEr": 1.4218121069160456, "SPICE": 0.2402601718946845}} {"test": {"Bleu_1": 0.8293025112126673, "Bleu_2": 0.6928262979474319, "Bleu_3": 0.5591914780467612, "Bleu_4": 0.4419784673449433, "METEOR": 0.30831059059342697, "ROUGE_L": 0.6176090226726898, "CIDEr": 1.4323682191311553, "SPICE": 0.24242663638610848}} I don‘t konw if I success? 🧸 Looking forward your reply!

Wow, this is really good result. Did you face the max_length problem? Or could you please show the shape of attention_mask as is posted above? Thank you!

lxr-1204 commented 1 month ago

No, I don't face the max_length problem. My atts_opt.shape = [1, 32], opt_tokens.attention_mask.shape = [1, 4] and attention_mask.shape = [1, 36]

WeichengDai1 commented 1 month ago

No, I don't face the max_length problem. My atts_opt.shape = [1, 32], opt_tokens.attention_mask.shape = [1, 4] and attention_mask.shape = [1, 36]

Sounds good. Thank you so much for your reply!

LuoyaoChen commented 1 month ago

Hi, @lxr-1204 !

Thank you so much for your reply! It is encouraging to know that your approach works. There were 2 differences I could imagine that might have caused my low performance,

Could you check what was your transformer version? I am currently using 4.42.0.dev.
Also, could you inform me what type/how many GPU's did you finetune on? I fine-tuned on 4 A100 (each 40G) GPU's, each with batch size 64, to match the paper's fine-tune batch size = 256.

Thank you!

lxr-1204 commented 1 month ago

Hello, @LuoyaoChen

My transformers version is 4.33.2
I fine-tuned on 4 L20 (each 48G) GPU's and each with batch size 32. This may be the reason why I haven't reached the same level as the author.

By the way, I don't seem to have seen your fellow's LAVIS/run_scripts/blip2/train/train_caption_coco_from_scratch.sh.

LuoyaoChen commented 1 month ago

@lxr-1204

Thank you!! Yes, I downgraded to 4.33.2 and BLEU_4 increased. Just FYI, I used batch size = 64, using four A100, fine-tuned for 5 epochs, and my scores are similar to yours. So batch size might not be an issue, also, the scores are close enough to paper (using the test set, bleu_4 = 43.5) I guess?

{"val": {"Bleu_1": 0.8211486028870585, "Bleu_2": 0.6786320500852762, "Bleu_3": 0.5412082052650652, "Bleu_4": 0.42445955867868645, "METEOR": 0.3000809459398856, "ROUGE_L": 0.6023497159599724, "CIDEr": 1.3504911582924761, "SPICE": 0.2336149951256639}}
{"val": {"Bleu_1": 0.8239893849887242, "Bleu_2": 0.6832399775841378, "Bleu_3": 0.5456446406373248, "Bleu_4": 0.428344329322689, "METEOR": 0.30151605730722364, "ROUGE_L": 0.6080119244575778, "CIDEr": 1.3715624253497698, "SPICE": 0.23541176362446695}}
{"val": {"Bleu_1": 0.8255328595145602, "Bleu_2": 0.6868702611201942, "Bleu_3": 0.5519031726420236, "Bleu_4": 0.4354463971880122, "METEOR": 0.3033446173121947, "ROUGE_L": 0.6107641429094333, "CIDEr": 1.3877831392810822, "SPICE": 0.235643357884401}}
{"val": {"Bleu_1": 0.8279085623707302, "Bleu_2": 0.689315763607898, "Bleu_3": 0.5542196085565972, "Bleu_4": 0.4393303273526725, "METEOR": 0.30652090677813526, "ROUGE_L": 0.6141870544175783, "CIDEr": 1.4020532681341553, "SPICE": 0.23846326296514572}}
{"val": {"Bleu_1": 0.8318979270412391, "Bleu_2": 0.6933581693794059, "Bleu_3": 0.5580103441892669, "Bleu_4": 0.442200620617155, "METEOR": 0.3089680987367561, "ROUGE_L": 0.6158630217893815, "CIDEr": 1.418023784164105, "SPICE": 0.24063981240115964}}
{"test": {"Bleu_1": 0.8305294220120865, "Bleu_2": 0.6912970605644893, "Bleu_3": 0.5565461847006546, "Bleu_4": 0.4406486648331637, "METEOR": 0.3095835115295356, "ROUGE_L": 0.6167463265705526, "CIDEr": 1.4311418829460887, "SPICE": 0.24323871537233485}}

I was replicating the pretraining stages too, so I re-named the .sh files to load pre-trained checkpoints. But other contents are the same.

Thank you again for sharing!

salesforce / LAVIS