[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.
I am wondering why it is critical to include the video start tag start in one human conversation input and then the video end tag in the subsequent human input? I am currently in the process of creating my own dataset, so I was just wondering why these tags are needed and why the start and end tags are in two subsequent inputs?
Hi @mmaaz60,
I am wondering why it is critical to include the video start tag start in one human conversation input and then the video end tag in the subsequent human input? I am currently in the process of creating my own dataset, so I was just wondering why these tags are needed and why the start and end tags are in two subsequent inputs?
https://github.com/mbzuai-oryx/Video-ChatGPT/blob/2e2fcf0e561b34ecc9271c3a8da49d63e891bd40/scripts/convert_instruction_json_to_training_format.py#L46C1-L50C111
Thanks for your response.
Quest2GM