some difference between paper and code in BLIP2 text generation

salesforce / LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence

BSD 3-Clause "New" or "Revised" License

9.92k stars 972 forks source link

some difference between paper and code in BLIP2 text generation #342

Open xieck13 opened 1 year ago

xieck13 commented 1 year ago

https://github.com/salesforce/LAVIS/blob/59273f651b9bffb193d1b12a235e909e9f826dda/lavis/models/blip2_models/blip2_qformer.py#L242

Hello,

I was going through the code in BLIP-2's repository and I noticed that in the blip2_qformer.py file, line 242, the text generation task seems to be using a "Bi-directional Self-Attention Mask" instead of the "Causal Self-Attention Mask" mentioned in the BLIP-2 paper. Can you please clarify if this is indeed the case or if there's any other explanation?

Thank you.

NielsRogge commented 1 year ago

Hi,

The text generation does use a causal mask, but it's pretty hidden in the code. The attention_mask gets turned into a causal one in the language model, specifically here in case OPT is used for the language model.

The same thing happens for other language models, like T5 (only the decoder of T5 has a causal attention mask).

tgyy1995 commented 1 year ago

I had some trouble reproducing InstructBlip model results on the msvd_qa and msrvtt_qa datasets. Could you please tell me what prompt template and hyperparameters were used for these datasets ? I raised a question here #333 , but so far no one has answered. @NielsRogge Thank you

xieck13 commented 1 year ago

@NielsRogge Hi, thanks for your reply~ The T5/OPT language models have indeed implemented causal attention mask in code. However, the BLIP2 paper mentions that T5/OPT was only used in Stage 2, while in Stage 1 only the QFormer was trained. It appears that causal attention mask was not utilized in the text generation task of QFormer.

attention mask in stage 1 text generation task https://github.com/salesforce/LAVIS/blob/59273f651b9bffb193d1b12a235e909e9f826dda/lavis/models/blip2_models/blip2_qformer.py#LL242C1-L250C1

yaml config of BLIP2 stage 1 training https://github.com/salesforce/LAVIS/blob/59273f651b9bffb193d1b12a235e909e9f826dda/lavis/projects/blip2/train/pretrain_stage1.yaml#L7

registry of blip2 https://github.com/salesforce/LAVIS/blob/59273f651b9bffb193d1b12a235e909e9f826dda/lavis/models/blip2_models/blip2_qformer.py#L25

LiJunnan1992 commented 1 year ago

You can find the conversion to causal mask here: https://github.com/salesforce/LAVIS/blob/59273f651b9bffb193d1b12a235e909e9f826dda/lavis/models/blip2_models/Qformer.py#L886

xieck13 commented 1 year ago

@LiJunnan1992 But in caption task, "is_decoder" is set as False(default value) here https://github.com/salesforce/LAVIS/blob/59273f651b9bffb193d1b12a235e909e9f826dda/lavis/models/blip2_models/blip2_qformer.py#LL243C26-L243C33