Question regarding video question answering task (TGIF)

pkunlp-icler / FastV

[ECCV 2024] Code for paper: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

219 stars 9 forks source link

Question regarding video question answering task (TGIF) #28

Open MarsJacobs opened 1 week ago

MarsJacobs commented 1 week ago

Hi, thank you for your excellent work! I noticed in Table 6 that the TGIF Flops=100% baseline accuracy is much lower than reported in the original paper:

Video-LLaVA reports TGIF Accuracy/Score - 70.0, 4.0
Fast-V reports TGIF Accuracy/Score - 20.0, 2.6

Could you please clarify why there is such a significant difference in the baseline accuracy? Additionally, I’d like to know if Fast-V uses the same inference code as the Video-LLaVA repository.

Thank you in advance!

chenllliang commented 2 days ago

Hi, we did use the same inference code as the Video-LLaVA repository. We'll check it out soon, maybe after the ICLR ddl.