salesforce / BLIP

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
BSD 3-Clause "New" or "Revised" License
4.62k stars 615 forks source link

Is the LM better than MLM? #100

Open SKBL5694 opened 1 year ago

SKBL5694 commented 1 year ago

I found in the paper BLIP, you use define the loss as ITC + ITM + LM. However, in ALBEF, the loss is defined as ITC + ITM +MLM. Is the LM better than MLM or or there are other reasons you used LM instead of MLM?

LiJunnan1992 commented 1 year ago

Hi, the primary reason for using LM is because we want to enable image-to-text generation capability. Both losses perform similarly in terms of VL representation learning (MLM can be slightly better sometimes).

SKBL5694 commented 1 year ago

Hi, the primary reason for using LM is because we want to enable image-to-text generation capability. Both losses perform similarly in terms of VL representation learning (MLM can be slightly better sometimes).

Thanks for reply. In the paper ALBEF Chapter 5 and Chapter 6, I find that model can also do the VQA task. And in that paper, you say you consider VQA as an answer generation problem. Is that mean you add a decoder for the VQA task(a downstream task), and train a task-specific decoder not included in the "pre-train model"? However, in BLIP, that decoder is included in the "pre-train model". Am I right?

LiJunnan1992 commented 1 year ago

For BLIP, the decoder in the pre-trained model. For ALBEF, we use the pre-trained encoder model to initialize the decoder.