The composition of the visual instruction tuning datasets

haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

https://llava.hliu.cc

Apache License 2.0

20.15k stars 2.22k forks source link

The composition of the visual instruction tuning datasets #702

Closed Caizifen closed 1 year ago

Caizifen commented 1 year ago

Question

├── coco
│   └── train2017
├── gqa
│   └── images
├── ocr_vqa
│   └── images
├── textvqa
│   └── train_images
└── vg
    ├── VG_100K
    └── VG_100K_2

The above is the structure of the fine-tuning dataset provided. After I downloaded the data according to the README, the total number is not 665k, only 608k. Did I miss anything?

coco	gqa	ocr_vqa	textvqa	VG_100K	VG_100K_2	Total
118287	148854	207572	25119	64346	43903	608081

jiaxiangc commented 1 year ago

How to download ocr_vqa? what is pdb?

haotian-liu commented 1 year ago

Apologies for the confusion. I just re-calculated the exact samples in the dataset mixture, and we will update the paper to correct the sample count for RefCOCO and A-OKVQA. Note that the released dataset is correct, only the number reported in the table is off for these two datasets.

Dataset	Actual	Paper
LLaVA	157712	158K
SG40k	40688	40K
VQA-v2	82783	83K
GQA	72140	72K
OKVQA	8998	9K
OCRVQA	80000	80K
A-OKVQA	66160	~~50K~~ 66K
TextCaps	21953	22K
RefCOCO	48447	~~30K~~ 48K
VG	86417	86K
Total	665298	665K

Cooperx521 commented 8 months ago

@Caizifen Hello, I'm also confused about the difference between README and the table mentioned by @haotian-liu. Have you clarified it? And does the data mentioned in the README have contained all the data in the table?

421zuoduan commented 6 months ago

How to download ocr_vqa? what is pdb?

Hi, I want to know if you have solved this problem? i have encountered the same problem