Can we have the weights for BLIP2 aligned with vicuna 7b, before instructing it

gordonhu608 commented 1 year ago

As titled. Thank you !!!

gordonhu608 commented 1 year ago

Actually I just want to confirm is it here: https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained.pth

LiJunnan1992 commented 1 year ago

Here is the BLIP2 vicuna7b pretrained weights (before instruction tuning): https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_vicuna7b.pth

youthHan commented 1 year ago

Here is the BLIP2 vicuna7b pretrained weights (before instruction tuning): https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_vicuna7b.pth

@LiJunnan1992 Thanks for releasing the BLIP-Vicuna 7B! Is releasing the 13B pretrained model also in your plan?

utsavgarg commented 1 year ago

@LiJunnan1992 Thanks for sharing the BLIP2 vicuna7b pre-trained weights! I was trying to load the checkpoint but the file seems to be missing weights for a lot of Qformer layers (around 74). Can you please check this?

LiJunnan1992 commented 1 year ago

It is expected that the Q-former's text FFN parameters are missing from the stage-2 pre-trained checkpoint. If you want to initialize an InstructBLIP model, please load these parameters using the stage-1 pre-trained checkpoint. Uncomment this block of code to load the Q-former's text FFN parameters: https://github.com/salesforce/LAVIS/blob/59273f651b9bffb193d1b12a235e909e9f826dda/lavis/models/blip2_models/blip2_vicuna_instruct.py#L723

ZiqinZhou66 commented 1 year ago

Hi, thanks to your wonderful work. I also tried to generate the results with the original Qformer finetuned with Vicuna-7b before instruction training. I had done these two modifications:

Change the pretrained model into 'blip2_pretrained_vicuna7b.pth' instead 'instruct_blip_vicuna7b_trimmed.pth' in this line: https://github.com/salesforce/LAVIS/blob/59273f651b9bffb193d1b12a235e909e9f826dda/lavis/configs/models/blip2/blip2_instruct_vicuna7b.yaml#L11
Uncomment the code you have mentioned.

However, there is empty text generation output with different prompts as input and the 'outputs' is [1, 2] in this line: https://github.com/salesforce/LAVIS/blob/59273f651b9bffb193d1b12a235e909e9f826dda/lavis/models/blip2_models/blip2_vicuna_instruct.py#L357

Is there any other processing I need to do?

It is expected that the Q-former's text FFN parameters are missing from the stage-2 pre-trained checkpoint. If you want to initialize an InstructBLIP model, please load these parameters using the stage-1 pre-trained checkpoint. Uncomment this block of code to load the Q-former's text FFN parameters:

https://github.com/salesforce/LAVIS/blob/59273f651b9bffb193d1b12a235e909e9f826dda/lavis/models/blip2_models/blip2_vicuna_instruct.py#L723

LiJunnan1992 commented 1 year ago

Hi, if you are not doing instruction tuning, please set qformer_text_input=False, and there is no need to load the stage-1 model.

ldfandian commented 1 year ago

Here is the BLIP2 vicuna7b pretrained weights (before instruction tuning): https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_vicuna7b.pth

@LiJunnan1992 can you please help shed some lights on:

for captioning task, how to compare the result of blip2-vicuna7b and instructblip-vicuna7b? how much does instruct- help?
for captioning task, how to compare the result of blip2-vicuna7b and blip2-flant5-xl? maybe encoder-decoder flan-t5 is a better arch for image captioning task? In my understanding, instruct- just aligns the generation result with human writing style, but cannot help on reducing the intrinsic defects of vicuna, like hallucination problem.

ldfandian commented 1 year ago

blip2-vicuna7b and instructblip-vicuna7b?

I actually tried doing image captioning using the provided blip2_pretrained_vicuna7b.pth model (w/ blip2 vicuna model modified based on blip2_instruct_vicuna.py), and found a lot of hallucination description in the generated caption. ( https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_vicuna7b.pth )

In comparison, the instructblip model is much better, although it still has some hallucination description (more than blip2 flan-t5-xl model for image captioning task).

Are these findings expected?

MeinhardMark commented 9 months ago

Here is the BLIP2 vicuna7b pretrained weights (before instruction tuning): https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_vicuna7b.pth

Also, is there a BLIP2 Llama2-7b Pretrained weights before instruct tuning?

salesforce / LAVIS

Can we have the weights for BLIP2 aligned with vicuna 7b, before instructing it #344