Three stages for synthesizing high-quality instruction tuning data for text-centric VQA
Data Collection for collecting large-scale images with textual elements of diverse properties
Data Generation involves self-questioning, answering, and reasoning of the collected data.
Data Filtering for self-evaluation of the generated content, aiming to discard meaningless questions and erroneous answers by employing the evaluation capabilities of MLLMs.
Data Collection
3.8 million unlabeled text-rich images
Chart and Table focus on textual elements with intense statistical information
Slide, Screenshot, and WebImage are designed for the interaction between text and prominent visual message
Document/PDF, Receipt, and e-commerce contain images with fine and dense tex
Street-View is derived from natural scenes
Data Generation: Self-Questioning, Answering, and Reasoning
Stage 1: Self-Questioning We ask Gemini Pro to first comprehensively analyze the image and then raise questions based on its understanding.
MLLMs typically have weaker understanding capabilities of the textual elements... the extracted text to the prompt by employing expert OCR models.
Stage 2: Answering Gemini Pro is then instructed to give appropriate answers to the generated questions.
CoT and Few-shot prompting are used to enrich the contextual information and reliability.
Stage 3: Reasoning. We require Gemini Pro to elaborate on the detailed reasons behind its answers to think more about the connections between the questions and the visual elements, thus reducing hallucinations and providing accurate answers
Data Filtering: Self-Evaluation and Answering Consistency
Despite the effectiveness of SQUARE, the generated image-text pairs could face hallucinatory content, meaningless questions, and erroneous answers.
We thus devise filtering rules based on the Evaluation capabilities of LLMs to select high-quality VQA pairs
Self-Evaluation of MLLMs We prompt Gemini Pro as well as other advanced MLLMs to judge correctness.
Multi-Prompt Consistency We provide Gemini Pro with different but semantically similar prompts to answer the given question. Then we discard the VQA pairs if the generated answers are not stable in semantics
Multi-Context Consistency We further validate the VQA pairs by prepending the question with varied context information. (1) Answering with reasoning, (2) In-Context answering, (3) Naive answering
TextSquare
Model Architecture: TextSquare follows the paradigm of InternLM-Xcomposer2.
Vision Encoder modified from OpenAI CLIP ViT-L14-336, where the resolution is increased to 700 for improved performance
LLM based on InternLM-2, utilizing InternLM2-7B-ChatSFT as the practical variant
Projector, which semantically aligns the vision token and the text token
Supervised Fine-Tuning (SFT) with Square-10M
we unfreeze all the three components (i.e., the Vision Encoder, the LLM, and the Projector) and train the model in a resolution of 490.
In the second stage, the input resolution is increased to 700 and only the Vision Encoder is trained to adapt to the resolution change.
we further perform full-parameter fine-tuning in the resolution of 700. TextSquare demonstrates that with our Square-10M dataset, a model with 8B parameters and normal-size image resolution can achieve extraordinary performance on text-centric VQA, surpassing most available MLLMs and even the closed-source SOTA models.
Links
Why??
Summary (by claude3)
Highlights
Square-10M
Three stages for synthesizing high-quality instruction tuning data for text-centric VQA
Data Collection
Data Generation: Self-Questioning, Answering, and Reasoning
Data Filtering: Self-Evaluation and Answering Consistency
TextSquare
Evaluation