Summary

Link

BLIP: Bootstrapping Language-Image Pre-training for Uniﬁed Vision-Language Understanding and Generation
https://github.com/salesforce/BLIP
A succinct and nice summary from Salesforce - https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/

Author/Institution

What is this

propose BLIP: Bootstrapping Language-Image Pre-training for uniﬁed vision-language understanding and generation
Two contributions:
1. Model perspective: Multimodal mixture of Encoder-Decoder (MED)
  - The model is jointly pre-trained with three vision-language objectives: image-text contrastive learning, image-text matching, and image-conditioned language modeling.
2. Data perspective: Captioning and Filtering (CapFilt) - See Figure 3, and Table 1 for the effectiveness
  - a captioner to produce synthetic captions given web images, and a ﬁlter to remove noisy captions from both the original web texts and the synthetic texts
  - "Diversity is Key for Synthetic Captions" See Nucleus sampling

Comparison with previous researches. What are the novelties/good points?

most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks

BLIP can be transfered "ﬂexibly to both vision-language understanding and generation tasks"

Key points

Architecture: See Figure 2
- Unimodal encoder
- Image-grounded text encoder
- Image-grounded text decoder
- Three losses: ITC, ITM, LM
jointly optimize three objectives during pre-training, with two understanding-based objectives and one generation-based objective

How the author proved effectiveness of the proposal?

We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score).

Any discussions?

What should I read next?

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

nagataka / Read-a-Paper

BLIP: Bootstrapping Language-Image Pre-training for Uniﬁed Vision-Language Understanding and Generation #48

Summary

Link

Author/Institution

What is this

Comparison with previous researches. What are the novelties/good points?

Key points

How the author proved effectiveness of the proposal?

Any discussions?

What should I read next?