paperswithlove / papers-we-read

3 stars 0 forks source link

LLaVA-o1: Let Vision Language Models Reason Step-by-Step #53

Open JihoonJ opened 3 days ago

JihoonJ commented 3 days ago

Introduction

Links

Project Page
- https://github.com/PKU-YuanGroup/LLaVA-o1
Paper
- https://arxiv.org/pdf/2411.10440

Summary and Insight

Novel Design for Autonomous Multistage Reasoning unlike CoT
이를 위한 Dataset 공개(LLaVA-o1-100k)와 Infernece time scaling 최적화 기법 제안
Time scaling(self-improvement)도 Dataset으로 만들어 명시적으로 학습 시키면 도움이 된다.
답변 후보들에 대해 self-evaluation을 활용한 Stage Beam Search도 효과가 있다.
Multimodal Benchmark 성능이 좋아졌지만 Multimodality에 대한 활용은 Caption Stage에서 이미지와 관련된 설명 요청하는 것 외에는 없음
- DPO스러움. Multimodality 활용 가능할 듯 한데 아직은 LLM과 유사한 방법으로 진행해도 큰 문제 없음....

Contents

Enhancing Reasoning Capability through Structured Thinking

4-stage로 구성된 QA를 진행
1. Summary Stage: What's the problem? What should I do?
  - A brief outline in which the model summarizes the forthcoming task.
2. Caption Stage: What can I know about the image?
  - A description of the relevant parts of an image (if present), focusing on elements related to the question.
3. Reasoning Stage: How to solve the problem step-by-step?
  - A detailed analysis in which the model systematically considers the question.
4. Conclusion Stage: What is final answer?
  - A concise summary of the answer, providing a final response based on the preceding reasoning.
    Effective Inference Time Scaling using Stage-level Beam Search
Srage 마다 답변을 여러 개 생성하고 그 중 가장 좋은 답변을 선택하고, 그 다음 Stage로 넘어가는 Stage-level Beam Search를 제안
Greedy Search 보단 좋은 결과가 나오더라...

Evaluation

총평

Multistage로 학습 가능한 LLaVA-o1-100k dataset을 사용하면 성능이 개선된다.
Multistage마다 tag로 명시적인 structure를 사용할 때 성능이 개선된다
Stage Bean Search를 하면 성능 개선이 된다. (Beam 수가 많으면 더 잘 된다)