In my understanding, VQA is similar with the ability of zero-shot image-to-text generation mentioned in the BLIP2 paper. They all give the answer about prompt(question / natural language instructions) conditioned on images. So I'm curious about what is the difference between the Instructed Zero-shot Image-to-Text Generation and Visual Question Answering about BLIP2?
In my understanding, VQA is similar with the ability of zero-shot image-to-text generation mentioned in the BLIP2 paper. They all give the answer about prompt(question / natural language instructions) conditioned on images. So I'm curious about what is the difference between the Instructed Zero-shot Image-to-Text Generation and Visual Question Answering about BLIP2?