Open Richar-Du opened 1 year ago
Same question. Just check Table 8 (no OK-VQA training data), I think the performance can be attributed to:
However, similar training data construction actually exists in the community. @zzhanghub @kq-chen Do you have any other intuitive ideas about this question?
BTW, I think the ablation study of the training data is also important. Thx.
Could the authors please answer this question :) @zzhanghub @kq-chen
Thanks for your awesome work! Shikras opens a way to effectively represent the coordinates in the image.
I have a question about the result in Table 6: the performance of Shikra on OK-VQA dataset is quite surprising, do you fine-tune Shikra on OK-VQA or does instruction-tuning data include OK-VQA?