I noticed that there is a huge performance boost of Video-LaVIT over LaVIT on benchmarks like VQAv2(from 66.0 to 80.2), GQA(from 46.8 to 63.6), VizWiz (from 38.5 to 54.0).
But there seems to be no explanation in the Video-LaVIT paper regarding this. (Sorry if I accidentally missed this part.)
Could you please show me how did you achieve this performance boost? Thanks in advance.
For LaVIT, we report zero-shot performance.
For Video-LaVIT, we report SFT performance with the same instruction dataset and the base model as LLaVA-1.5.
Thanks for your great work.
I noticed that there is a huge performance boost of Video-LaVIT over LaVIT on benchmarks like VQAv2(from 66.0 to 80.2), GQA(from 46.8 to 63.6), VizWiz (from 38.5 to 54.0).
But there seems to be no explanation in the Video-LaVIT paper regarding this. (Sorry if I accidentally missed this part.)
Could you please show me how did you achieve this performance boost? Thanks in advance.