mbzuai-oryx / Video-ChatGPT

[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.
https://mbzuai-oryx.github.io/Video-ChatGPT
Creative Commons Attribution 4.0 International
1.23k stars 108 forks source link

Why is OpenAI used for Activitynet-QA Evaluation? #83

Closed mcouts01 closed 7 months ago

mcouts01 commented 10 months ago

I noticed that the evaluation file for the activitynet-qa dataset presented in this repository utilizes OpenAI and ChatGPT to generate the quantitative scores. This is different than the example provided in the official activitynet-qa repository

Why was this done? Thanks

mmaaz60 commented 10 months ago

Hi @mcouts01

Thank you for your interest in our work.

The key distinction between our proposed approach and the standard ActivityNet evaluation lies in the nature of the responses generated by Video-ChatGPT. As a LLM powered Video-Conversational model, Video-ChatGPT produces human-like, free-form textual responses. These responses, due to their unstructured nature, cannot be directly compared with the structured ground truth answers typically used in ActivityNet.

To address this, we utilize OpenAI's GPT-3.5 for evaluation. It helps in comparing the Video-ChatGPT responses with the ground truth by rating them in a more qualitative manner. This approach aligns better with the conversational and free-flowing format of responses generated by Video-ChatGPT, providing a more realistic assessment of its performance in real-world scenarios.

I hope this clarifies your concern. Thanks

mcouts01 commented 10 months ago

Thank you for the response!

This approach is new to me. It seems as though scores would be subjective to the understanding and response of GPT-3.5.

Have you noticed any changes in the evaluation scores throughout multiple evaluations of the same version of the model?

mmaaz60 commented 10 months ago

Hi @mcouts01,

Apologies for the delayed response, you are write there may be minor variations in the number but it is mainly because of the randomness of generation, which can be minimized using GPT based subjective comparison. Following this, I would highly recommend reporting std with the numbers.