Open xieck13 opened 1 year ago
Hi,
The text generation does use a causal mask, but it's pretty hidden in the code. The attention_mask
gets turned into a causal one in the language model, specifically here in case OPT is used for the language model.
The same thing happens for other language models, like T5 (only the decoder of T5 has a causal attention mask).
I had some trouble reproducing InstructBlip model results on the msvd_qa and msrvtt_qa datasets. Could you please tell me what prompt template and hyperparameters were used for these datasets ? I raised a question here #333 , but so far no one has answered. @NielsRogge Thank you
@NielsRogge Hi, thanks for your reply~ The T5/OPT language models have indeed implemented causal attention mask in code. However, the BLIP2 paper mentions that T5/OPT was only used in Stage 2, while in Stage 1 only the QFormer was trained. It appears that causal attention mask was not utilized in the text generation task of QFormer.
attention mask in stage 1 text generation task https://github.com/salesforce/LAVIS/blob/59273f651b9bffb193d1b12a235e909e9f826dda/lavis/models/blip2_models/blip2_qformer.py#LL242C1-L250C1
yaml config of BLIP2 stage 1 training https://github.com/salesforce/LAVIS/blob/59273f651b9bffb193d1b12a235e909e9f826dda/lavis/projects/blip2/train/pretrain_stage1.yaml#L7
registry of blip2 https://github.com/salesforce/LAVIS/blob/59273f651b9bffb193d1b12a235e909e9f826dda/lavis/models/blip2_models/blip2_qformer.py#L25
You can find the conversion to causal mask here: https://github.com/salesforce/LAVIS/blob/59273f651b9bffb193d1b12a235e909e9f826dda/lavis/models/blip2_models/Qformer.py#L886
@LiJunnan1992 But in caption task, "is_decoder" is set as False(default value) here https://github.com/salesforce/LAVIS/blob/59273f651b9bffb193d1b12a235e909e9f826dda/lavis/models/blip2_models/blip2_qformer.py#LL243C26-L243C33
https://github.com/salesforce/LAVIS/blob/59273f651b9bffb193d1b12a235e909e9f826dda/lavis/models/blip2_models/blip2_qformer.py#L242
Hello,
I was going through the code in BLIP-2's repository and I noticed that in the blip2_qformer.py file, line 242, the text generation task seems to be using a "Bi-directional Self-Attention Mask" instead of the "Causal Self-Attention Mask" mentioned in the BLIP-2 paper. Can you please clarify if this is indeed the case or if there's any other explanation?
Thank you.