BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Li et al., arXiv 2023

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.

🔑 Key idea:

Exploiting frozen LLM and image encoder, leverage zero-shot vision-language capability, especially the image-grounded conversation using conversational prompts.
"Since LLMs have not seen images during their unimodal pre-training, freezing them makes vision-language alignment in particular challenging."

💪 Strength:

Contrary to the other adaptive-layer works, they used learned queries (32 x 768 dims) for the image transformer (left pillar; compared to the right pillar of the text transformer).
Incorporate well-known successful losses, Image-Text Match (ITM), Image-Text Contrastive (ITC), and Image-Grounded Text Generation (ITG). The three different attention masking strategies are applied to information control to extract as much as informative representations for images in the first stage of pre-training. The authors emphasized this with "This bottleneck architecture works together with our pre-training objectives into forcing the queries to extract visual information that is most relevant to the text" in Sec 3.1, and "the queries are forced to extract visual features that capture all the information about the text." in Sec 3.2.
The second stage is proposed to align with any LLMs, a decoder-based LLM (e.g., OPT) or an encoder-decoder-based LLM (e.g., FlanT5). "The projected query embeddings are then prepended to the input text embeddings." Which is "soft visual prompts" to LLMs.

😵 Weakness:

Why not one-stage training? What are the cons of it? No ablation study, no explanation.
I love this paper, but what would you say is the most novel idea? Using the lightweight Querying Transformer?

🤔 Confidence:

High

✏️ Memo:

The batch sizes are 2320/1680 for ViT-L/ViT-G in the first stage and 1920/1520 for OPT/FlanT5.
FP16 or BFloat16. No performance degradation compared with the 32-bit format.
A single 16-A100(40G) < 6 days (1st stage) + 3 days (2nd stage)
See Sec 3.4 for more details.

jnhwkim / Pensees

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models #16

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

🔑 Key idea:

💪 Strength:

😵 Weakness:

🤔 Confidence:

✏️ Memo: