[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.
Thank you for your interest in our work and apologies for the late response. We use the validation set for MSVD and MSRVTT and the test set for TGIF. A copy of the annotation files have been attached in the links.
for the results in the few shot table which file did you use test_qa or val_qa for MSVD-QA and MSRVTT-QA evaluation