salesforce / BLIP

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
BSD 3-Clause "New" or "Revised" License
4.86k stars 648 forks source link

Some doubts about weights #98

Open SKBL5694 opened 2 years ago

SKBL5694 commented 2 years ago

I use train_vqa.py to finetune the origin weights "model_base_vqa_capfilt_large.pth" which file size is about 1.34GB. But when I finish the finetune, the new weight is about 4.04GB. Both weights can be loaded by the model. What is the reason for the difference between the two weights?

LiJunnan1992 commented 2 years ago

Hi, for the VQA task, the text encoder and decoder do not share parameters (their parameters are shared during pre-training).

SKBL5694 commented 2 years ago

Hi, for the VQA task, the text encoder and decoder do not share parameters (their parameters are shared during pre-training).

Thanks for your reply. I think you mean the chapter 4.4 and fig.5 in your paper. But I still have some confusion on it. I think I don't fully understand your paper. I want to know if there is a method I can finetune the model by VQA task, but keep the parameters between encoder and decoder shared? Another question is don't you get the original weights by fine-tuning the model on VQA task? If you fine-tune it by VQA, why the parameters between encoder and decoder keep the same; if you don't fine-tune, why the origin weights can perform well during my test on VQA task? If you think the answer is in your paper, please tell me which part of the paper, I will read it carefully again. I'm sorry I didn't fully understand your paper just ask here, looking forward to you giving me some hints again.