TextSquare: Scaling up Text-Centric Visual Instruction Tuning

Links

Archive: https://arxiv.org/pdf/2404.12803.pdf
Project: N/A

Why??

TextSquare의 지향점이 올해 그룹 목표와 굉장히 유사하며, 현재 Dataset 구축을 고민함에 있어 좋은 Reference가 될 것으로 생각됨

Summary (by claude3)

새로운 대규모 Instruction Tuning 데이터셋인 Square-10M을 소개하고, 이를 통해 Text-centric VQA 모델인 TextSquare를 개발
TextSquare는 기존의 오픈소스 최신 모델들을 넘어서는 성능을 보였으며, 일부 벤치마크에서는 GPT4V와 Gemini 같은 최고 수준 모델들보다 뛰어난 성능을 보임
VQA 추론 데이터의 중요성을 입증했으며, 데이터 규모가 exponential하게 증가할수록 모델 성능도 비례하여 향상되는 패턴을 보여줌
SQUARE: Self-Questioning, Answering, Reasoning, and Evaluation. (근래 보기 드문 깔끔한 네이밍)
Instruction Set도 10M 정도 되면 Full parameter를 1-stage로 학습 가능
64 A100-80G GPUs with 1st Stage: 9520 GPU Hours, 2nd Stage: 7280 GPU Hours, 3rd Stage: 12350 GPU Hours

Highlights

Square-10M

Three stages for synthesizing high-quality instruction tuning data for text-centric VQA
1. Data Collection for collecting large-scale images with textual elements of diverse properties
2. Data Generation involves self-questioning, answering, and reasoning of the collected data.
3. Data Filtering for self-evaluation of the generated content, aiming to discard meaningless questions and erroneous answers by employing the evaluation capabilities of MLLMs.
Data Collection
- 3.8 million unlabeled text-rich images
- Chart and Table focus on textual elements with intense statistical information
- Slide, Screenshot, and WebImage are designed for the interaction between text and prominent visual message
- Document/PDF, Receipt, and e-commerce contain images with fine and dense tex
- Street-View is derived from natural scenes
Data Generation: Self-Questioning, Answering, and Reasoning
- Stage 1: Self-Questioning We ask Gemini Pro to first comprehensively analyze the image and then raise questions based on its understanding.
  - MLLMs typically have weaker understanding capabilities of the textual elements... the extracted text to the prompt by employing expert OCR models.
- Stage 2: Answering Gemini Pro is then instructed to give appropriate answers to the generated questions.
  - CoT and Few-shot prompting are used to enrich the contextual information and reliability.
- Stage 3: Reasoning. We require Gemini Pro to elaborate on the detailed reasons behind its answers to think more about the connections between the questions and the visual elements, thus reducing hallucinations and providing accurate answers
Data Filtering: Self-Evaluation and Answering Consistency
- Despite the effectiveness of SQUARE, the generated image-text pairs could face hallucinatory content, meaningless questions, and erroneous answers.
- We thus devise filtering rules based on the Evaluation capabilities of LLMs to select high-quality VQA pairs
  1. Self-Evaluation of MLLMs We prompt Gemini Pro as well as other advanced MLLMs to judge correctness.
  2. Multi-Prompt Consistency We provide Gemini Pro with different but semantically similar prompts to answer the given question. Then we discard the VQA pairs if the generated answers are not stable in semantics
  3. Multi-Context Consistency We further validate the VQA pairs by prepending the question with varied context information. (1) Answering with reasoning, (2) In-Context answering, (3) Naive answering

TextSquare

Model Architecture: TextSquare follows the paradigm of InternLM-Xcomposer2.
- Vision Encoder modified from OpenAI CLIP ViT-L14-336, where the resolution is increased to 700 for improved performance
- LLM based on InternLM-2, utilizing InternLM2-7B-ChatSFT as the practical variant
- Projector, which semantically aligns the vision token and the text token
Supervised Fine-Tuning (SFT) with Square-10M
1. we unfreeze all the three components (i.e., the Vision Encoder, the LLM, and the Projector) and train the model in a resolution of 490.
2. In the second stage, the input resolution is increased to 700 and only the Vision Encoder is trained to adapt to the resolution change.
3. we further perform full-parameter fine-tuning in the resolution of 700. TextSquare demonstrates that with our Square-10M dataset, a model with 8B parameters and normal-size image resolution can achieve extraordinary performance on text-centric VQA, surpassing most available MLLMs and even the closed-source SOTA models.

paperswithlove / papers-we-read

TextSquare: Scaling up Text-Centric Visual Instruction Tuning #25

Links

Why??

Summary (by claude3)

Highlights

Square-10M

TextSquare

Evaluation