Closed Eniac-Xie closed 8 months ago
Hi @Eniac-Xie,
Thank you for pointing out, it is typo that happen during the code release. I have now updated it. Thanks
@mmaaz60 Thank you!
Another question: I find the scores provided by ChatGPT is not stable, i.e., when I evaluate the same QA results (only 10 QA pairs) twice, I get different scores? I think it is due to the sampling diversity of ChatGPT. So how can we obtain a stable score? By introducing more testing data?
Hi @Eniac-Xie,
Yes, adding more data might help, further I would recommend to run the evaluation multiple times and report the average stats.
Further, you can set the temperature
to zero and additionally set the seed as well to a constant value. This will definitely help in achieving consistent results across runs. Please have a look at the docs at https://platform.openai.com/docs/api-reference/chat/create for details.
Hi, thanks for your great work!
I have a question about the consistency evaluation, for the code here: https://github.com/mbzuai-oryx/Video-ChatGPT/blob/main/quantitative_evaluation/evaluate_benchmark_5_consistency.py#L121 you assign
question2 = sample['Q1']
instead ofquestion2 = sample['Q2']
why?