Open gordonhu608 opened 1 year ago
Actually I just want to confirm is it here: https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained.pth
Here is the BLIP2 vicuna7b pretrained weights (before instruction tuning): https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_vicuna7b.pth
Here is the BLIP2 vicuna7b pretrained weights (before instruction tuning): https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_vicuna7b.pth
@LiJunnan1992 Thanks for releasing the BLIP-Vicuna 7B! Is releasing the 13B pretrained model also in your plan?
@LiJunnan1992 Thanks for sharing the BLIP2 vicuna7b pre-trained weights! I was trying to load the checkpoint but the file seems to be missing weights for a lot of Qformer layers (around 74). Can you please check this?
It is expected that the Q-former's text FFN parameters are missing from the stage-2 pre-trained checkpoint. If you want to initialize an InstructBLIP model, please load these parameters using the stage-1 pre-trained checkpoint. Uncomment this block of code to load the Q-former's text FFN parameters: https://github.com/salesforce/LAVIS/blob/59273f651b9bffb193d1b12a235e909e9f826dda/lavis/models/blip2_models/blip2_vicuna_instruct.py#L723
Hi, thanks to your wonderful work. I also tried to generate the results with the original Qformer finetuned with Vicuna-7b before instruction training. I had done these two modifications:
Change the pretrained model into 'blip2_pretrained_vicuna7b.pth' instead 'instruct_blip_vicuna7b_trimmed.pth' in this line: https://github.com/salesforce/LAVIS/blob/59273f651b9bffb193d1b12a235e909e9f826dda/lavis/configs/models/blip2/blip2_instruct_vicuna7b.yaml#L11
Uncomment the code you have mentioned.
However, there is empty text generation output with different prompts as input and the 'outputs' is [1, 2] in this line: https://github.com/salesforce/LAVIS/blob/59273f651b9bffb193d1b12a235e909e9f826dda/lavis/models/blip2_models/blip2_vicuna_instruct.py#L357
Is there any other processing I need to do?
It is expected that the Q-former's text FFN parameters are missing from the stage-2 pre-trained checkpoint. If you want to initialize an InstructBLIP model, please load these parameters using the stage-1 pre-trained checkpoint. Uncomment this block of code to load the Q-former's text FFN parameters:
Hi, if you are not doing instruction tuning, please set qformer_text_input=False
, and there is no need to load the stage-1 model.
Here is the BLIP2 vicuna7b pretrained weights (before instruction tuning): https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_vicuna7b.pth
@LiJunnan1992 can you please help shed some lights on:
- blip2-vicuna7b and instructblip-vicuna7b?
I actually tried doing image captioning using the provided blip2_pretrained_vicuna7b.pth model (w/ blip2 vicuna model modified based on blip2_instruct_vicuna.py), and found a lot of hallucination description in the generated caption. ( https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_vicuna7b.pth )
In comparison, the instructblip model is much better, although it still has some hallucination description (more than blip2 flan-t5-xl model for image captioning task).
Are these findings expected?
Here is the BLIP2 vicuna7b pretrained weights (before instruction tuning): https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_vicuna7b.pth
Also, is there a BLIP2 Llama2-7b Pretrained weights before instruct tuning?
As titled. Thank you !!!