Question about consistency evaluation

mbzuai-oryx / Video-ChatGPT

[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.

https://mbzuai-oryx.github.io/Video-ChatGPT

Creative Commons Attribution 4.0 International

1.23k stars 108 forks source link

Question about consistency evaluation #94

Closed Eniac-Xie closed 8 months ago

Eniac-Xie commented 8 months ago

Hi, thanks for your great work!

I have a question about the consistency evaluation, for the code here: https://github.com/mbzuai-oryx/Video-ChatGPT/blob/main/quantitative_evaluation/evaluate_benchmark_5_consistency.py#L121 you assign question2 = sample['Q1'] instead of question2 = sample['Q2'] why?

mmaaz60 commented 8 months ago

Hi @Eniac-Xie,

Thank you for pointing out, it is typo that happen during the code release. I have now updated it. Thanks

Eniac-Xie commented 8 months ago

@mmaaz60 Thank you!

Another question: I find the scores provided by ChatGPT is not stable, i.e., when I evaluate the same QA results (only 10 QA pairs) twice, I get different scores? I think it is due to the sampling diversity of ChatGPT. So how can we obtain a stable score? By introducing more testing data?

mmaaz60 commented 8 months ago

Hi @Eniac-Xie,

Yes, adding more data might help, further I would recommend to run the evaluation multiple times and report the average stats.

Further, you can set the temperature to zero and additionally set the seed as well to a constant value. This will definitely help in achieving consistent results across runs. Please have a look at the docs at https://platform.openai.com/docs/api-reference/chat/create for details.