Open SKBL5694 opened 2 years ago
Hi, the primary reason for using LM is because we want to enable image-to-text generation capability. Both losses perform similarly in terms of VL representation learning (MLM can be slightly better sometimes).
Hi, the primary reason for using LM is because we want to enable image-to-text generation capability. Both losses perform similarly in terms of VL representation learning (MLM can be slightly better sometimes).
Thanks for reply. In the paper ALBEF Chapter 5 and Chapter 6, I find that model can also do the VQA task. And in that paper, you say you consider VQA as an answer generation problem. Is that mean you add a decoder for the VQA task(a downstream task), and train a task-specific decoder not included in the "pre-train model"? However, in BLIP, that decoder is included in the "pre-train model". Am I right?
For BLIP, the decoder in the pre-trained model. For ALBEF, we use the pre-trained encoder model to initialize the decoder.
I found in the paper BLIP, you use define the loss as ITC + ITM + LM. However, in ALBEF, the loss is defined as ITC + ITM +MLM. Is the LM better than MLM or or there are other reasons you used LM instead of MLM?