mbzuai-oryx / Video-LLaVA

PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models
https://mbzuai-oryx.github.io/Video-LLaVA
241 stars 11 forks source link

Could you early release the evaluation scripts with vicuna model. #1

Open KerolosAtef opened 11 months ago

avinash31d commented 11 months ago

+1

shehanmunasinghe commented 10 months ago

@KerolosAtef @avinash31d , Thank you for your interest in our work. Please find the details about the Vicuna-based quantitative evaluation benchmark here: https://github.com/mbzuai-oryx/Video-LLaVA/tree/main/quantitative_evaluation.

KerolosAtef commented 10 months ago

thank you very much, but also the Vicuna model doesn't output the same results for each run.

I have tried to reproduce some of the results of video chat GPT and this the results: ActivityNet : Acc :36.13 instead of 40.8 TGIF: Acc: 63.07 instead of 66.5

shehanmunasinghe commented 10 months ago

@KerolosAtef We attribute this to the randomness introduced by the temperature parameter in both the tested model and the LLM used for evaluation. This will be addressed in our future work.

KerolosAtef commented 10 months ago

okay good, I want to make sure of something, for the Zeroshot datasets (MSVD, MSR-VTT,Activity_net,TGIF) Are you used the testing data or the validation data?

shehanmunasinghe commented 10 months ago

We follow the same approach as Video-ChatGPT, i.e. using test splits.