haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.15k stars 2.22k forks source link

The composition of the visual instruction tuning datasets #702

Closed Caizifen closed 1 year ago

Caizifen commented 1 year ago

Question

├── coco
│   └── train2017
├── gqa
│   └── images
├── ocr_vqa
│   └── images
├── textvqa
│   └── train_images
└── vg
    ├── VG_100K
    └── VG_100K_2

The above is the structure of the fine-tuning dataset provided. After I downloaded the data according to the README, the total number is not 665k, only 608k. Did I miss anything?

coco gqa ocr_vqa textvqa VG_100K VG_100K_2 Total
118287 148854 207572 25119 64346 43903 608081
jiaxiangc commented 1 year ago

How to download ocr_vqa? what is pdb?

haotian-liu commented 1 year ago

Apologies for the confusion. I just re-calculated the exact samples in the dataset mixture, and we will update the paper to correct the sample count for RefCOCO and A-OKVQA. Note that the released dataset is correct, only the number reported in the table is off for these two datasets.

Dataset Actual Paper
LLaVA 157712 158K
SG40k 40688 40K
VQA-v2 82783 83K
GQA 72140 72K
OKVQA 8998 9K
OCRVQA 80000 80K
A-OKVQA 66160 50K 66K
TextCaps 21953 22K
RefCOCO 48447 30K 48K
VG 86417 86K
Total 665298 665K
Cooperx521 commented 8 months ago

@Caizifen Hello, I'm also confused about the difference between README and the table mentioned by @haotian-liu. Have you clarified it? And does the data mentioned in the README have contained all the data in the table?

421zuoduan commented 6 months ago

How to download ocr_vqa? what is pdb?

Hi, I want to know if you have solved this problem? i have encountered the same problem