[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.
I found many words like "\nQA_GT_caption_based_noisy" in json file, what's this? Is it just some noisy text that haven't been cleaned out?
"q": "Can you briefly describe the content of the video for me?",
"a": "Of course! The video features various shots of a city followed by a girl playing the violin. Another girl steps in and starts singing, which leads to several others joining in as well. A mime walks in with fireworks while the group still sings, and many around are seen laughing and smiling. Overall, the video features individuals playing the violin in different scenes and locations, making it quite enjoyable to watch.\nQA_GT_caption_based_noisy",
"video_id": "v_GqWH2IYPw6U"
I found many words like "\nQA_GT_caption_based_noisy" in json file, what's this? Is it just some noisy text that haven't been cleaned out?