Open jihwanp opened 7 months ago
I think Image-Grounded Text Generation(ITG) does. Based on the paper, ITM applies binary-classification to learn more precise alignment between image and text representation. ITG uses only visual information (Z embedding in the paper) to generate texts and compare it with Input Text in the figure 2 like normal sentence generation task.
Hi, I notice that BLIP2 without LLM model (1st stage pretrained) can perform zero-shot vqa task. Im curious which mechanism generates the answer of question . ITG or ITM? Thanks