salesforce / LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence
BSD 3-Clause "New" or "Revised" License
9.78k stars 958 forks source link

How does VQA works on BLIP2 without LLM? #659

Open jihwanp opened 7 months ago

jihwanp commented 7 months ago

Hi, I notice that BLIP2 without LLM model (1st stage pretrained) can perform zero-shot vqa task. Im curious which mechanism generates the answer of question . ITG or ITM? Thanks

tanukon commented 7 months ago

I think Image-Grounded Text Generation(ITG) does. Based on the paper, ITM applies binary-classification to learn more precise alignment between image and text representation. ITG uses only visual information (Z embedding in the paper) to generate texts and compare it with Input Text in the figure 2 like normal sentence generation task.