haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
19.33k stars 2.13k forks source link

I think this method looks like MM-COT, What's the difference with it except used llm? #12

Closed dongxiaolong closed 1 year ago

dongxiaolong commented 1 year ago

Absolutely awesome work. Firstly thanks for the instruct tuning data that contributed to the community. Just confused with above question mentioned in the title.

152334H commented 1 year ago

the other important part is the synthetic gpt-4 based dataset.

ChunyuanLI commented 1 year ago

The section of fine-tuning on ScienceQA is similar with MM-COT, in terms of architectures and data organization. One major finding in MM-CoT is that prediction order matters: reason-then-answer, and thus it is called "CoT". In our study, we find that the "CoT" claim is not very important. See the evidence in our paper:

Chain-of-thoughts. To decide the order between the answer and reasoning process in the model prediction, we run both variants and observe that answer-first reports the best number 89.77% accuracy in 12 epochs, while reasoning-first can quickly reach 89.77% accuracy in 6 epochs, but no further improvement with more training (89.96%). Training the model for 24 epochs does not improve the performance. We conclude that CoT-like reasoning-first strategy can largely improve convergence speed, but contributes relatively little to the final performance.

Now, for both papers, the most performance gain is the use of vision encoder and the end-to-end training on ScienceQA. This dataset is relatively small compared with VQA 2.0. It is easy to reach high performance by training a large model on it. I hope this should be noted for readers to make solid conclusions. Further, there are implementation difference between the two papers for us to reach the different conclusions: (1) The choice of LLM; (2) We have pre-training stage to connect the two modality, which leads 5% improvement compared with training from scratch, while MMCoT does not has this pre-training stage. Hope it can re-considered in the development.


I'd like to clarify that ScienceQA has helped us quantitively ablate our design choice in the early stage of the project, but ScienceQA is not the single main focus of this project. We aim to help the community produce multimodal GPT-4 level capability with minimum efforts: (1) From focus shift from model-centric to data-centric AI: the multimodal instruction-following data is the key, and the most of our time is spent on. (2) Achieving multimodal chat with detailed description such as OCR and complex reasoning. The current demo has preliminary capabilities on this. Hope the community can be inspired to scale up this approach to reach better performance.

152334H commented 1 year ago

In our study, we find that the "CoT" claim is not very important. See the evidence in our paper:

Wow, thanks for the work!