nagataka / Read-a-Paper

Survey
6 stars 1 forks source link

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation #48

Open nagataka opened 1 year ago

nagataka commented 1 year ago

Summary

Link

Author/Institution

What is this

Comparison with previous researches. What are the novelties/good points?

most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks

BLIP can be transfered "flexibly to both vision-language understanding and generation tasks"

Key points

How the author proved effectiveness of the proposal?

We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score).

Any discussions?

What should I read next?