propose BLIP: Bootstrapping Language-Image Pre-training for unified vision-language understanding and generation
Two contributions:
Model perspective: Multimodal mixture of Encoder-Decoder (MED)
The model is jointly pre-trained with three vision-language objectives: image-text contrastive learning, image-text matching, and image-conditioned language modeling.
Data perspective: Captioning and Filtering (CapFilt) - See Figure 3, and Table 1 for the effectiveness
a captioner to produce synthetic captions given web images, and a filter to remove noisy captions from both the original web texts and the synthetic texts
Comparison with previous researches. What are the novelties/good points?
most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks
BLIP can be transfered "flexibly to both vision-language understanding and generation tasks"
Key points
Architecture: See Figure 2
Unimodal encoder
Image-grounded text encoder
Image-grounded text decoder
Three losses: ITC, ITM, LM
jointly optimize three objectives during pre-training, with two understanding-based objectives and one generation-based objective
How the author proved effectiveness of the proposal?
We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score).
Summary
Link
Author/Institution
What is this
Comparison with previous researches. What are the novelties/good points?
BLIP can be transfered "flexibly to both vision-language understanding and generation tasks"
Key points
How the author proved effectiveness of the proposal?
Any discussions?
What should I read next?