Hi, thanks for your great project!
I am wondering how many training dataset instances you are used, such as COCO, OCR-VQA and A-OKVQA, did you just transform the original dataset with the template so the numbers are consistent with the original dataset?
I see the paper mention that 5k coco caption-image pair and 512 OCR, A-OKVQA pairs are used.
so if I am correct,y except for the LLAVA and minigpt4, there are 6k instances?
Hi, thanks for your great project! I am wondering how many training dataset instances you are used, such as COCO, OCR-VQA and A-OKVQA, did you just transform the original dataset with the template so the numbers are consistent with the original dataset?