Closed mcouts01 closed 7 months ago
Hi @mcouts01
Thank you for your interest in our work.
The key distinction between our proposed approach and the standard ActivityNet evaluation lies in the nature of the responses generated by Video-ChatGPT. As a LLM powered Video-Conversational model, Video-ChatGPT produces human-like, free-form textual responses. These responses, due to their unstructured nature, cannot be directly compared with the structured ground truth answers typically used in ActivityNet.
To address this, we utilize OpenAI's GPT-3.5 for evaluation. It helps in comparing the Video-ChatGPT responses with the ground truth by rating them in a more qualitative manner. This approach aligns better with the conversational and free-flowing format of responses generated by Video-ChatGPT, providing a more realistic assessment of its performance in real-world scenarios.
I hope this clarifies your concern. Thanks
Thank you for the response!
This approach is new to me. It seems as though scores would be subjective to the understanding and response of GPT-3.5.
Have you noticed any changes in the evaluation scores throughout multiple evaluations of the same version of the model?
Hi @mcouts01,
Apologies for the delayed response, you are write there may be minor variations in the number but it is mainly because of the randomness of generation, which can be minimized using GPT based subjective comparison. Following this, I would highly recommend reporting std
with the numbers.
I noticed that the evaluation file for the activitynet-qa dataset presented in this repository utilizes OpenAI and ChatGPT to generate the quantitative scores. This is different than the example provided in the official activitynet-qa repository
Why was this done? Thanks