what is the difference between the Instructed Zero-shot Image-to-Text Generation and Visual Question Answering about BLIP2?

salesforce / LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence

BSD 3-Clause "New" or "Revised" License

9.65k stars 943 forks source link

what is the difference between the Instructed Zero-shot Image-to-Text Generation and Visual Question Answering about BLIP2? #309

Open gyula-coder opened 1 year ago

gyula-coder commented 1 year ago

In my understanding, VQA is similar with the ability of zero-shot image-to-text generation mentioned in the BLIP2 paper. They all give the answer about prompt(question / natural language instructions) conditioned on images. So I'm curious about what is the difference between the Instructed Zero-shot Image-to-Text Generation and Visual Question Answering about BLIP2?

gyula-coder commented 1 year ago

can I consider instructed image-to-text generation as vqa, and the the new in blip2 is zero-shot?