haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
19.33k stars 2.13k forks source link

[Question] Some Questions about the datasets used and the model. #326

Closed duchenzhuang closed 1 year ago

duchenzhuang commented 1 year ago

Question

1, In my understanding, the first pretraining stage uses either the CC-3M Concept-balanced 595K dataset or the LAION/CC/SBU BLIP-Caption Concept-balanced 558K dataset. The second stage uses the coco-2017-train and llava_instruct_150k.json datasets. Is that correct?

2, Could you advise which dataset would be better to use for the first stage: CC-3M Concept-balanced 595K or LAION/CC/SBU BLIP-Caption Concept-balanced 558K?

3, Also, what's the difference between llava_instruct_150k.json and llava_instruct_70k.json in the second stage? It seems that the sizes of these two files are the same.

4, Furthermore, in the model_zoo, why is LLaMA-2-13B-Chat noticeably worse than Vicuna-13B-v1.3? What do you think might have caused this?

5, My last question, what are the files conversation_58k.json, detail_23k.json, and complex_reasoning_77k.json used for? It seems that neither the first nor the second stage of training requires them?

haotian-liu commented 1 year ago

Hi, thank you for your interest in our work, and these are great questions.

  1. Correct.
  2. In our initial release in the paper, we use CC-595K. In our later experiments (especially Lightning checkpoints), we start to use LCS-558K. Although both are capable of aligning and recognizing the concepts beyond training, we empirically verify on LLaVA Bench that LCS-558K is slightly better.
  3. 150K is used for our paper, 80K is used for lightning checkpoints for a faster convergence. You can check out the documentation here.
  4. The main reason is the image resolution, as you can see vicuna-13b-v1.3 is using 336px, while llama-2-13b-chat is currently using 224px. We'll release both resolutions for both base LLMs very soon. You can expect the performance at the same resolution being similar, while having some behavioral differences.
  5. They are the three types of instructions for our LLaVA-instruct-158K (add these numbers together you'll get 158K). We release all three subsets for a complete transparency and people can create their own subsets according to specific needs. You can refer to our paper for more details as well.