mbzuai-oryx / Video-ChatGPT

[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.
https://mbzuai-oryx.github.io/Video-ChatGPT
Creative Commons Attribution 4.0 International
1.05k stars 92 forks source link

Why is the <video> tag is needed in training json? #107

Closed Quest2GM closed 2 weeks ago

Quest2GM commented 1 month ago

Hi @mmaaz60,

I am wondering why it is critical to include the video start tag start in one human conversation input and then the video end tag in the subsequent human input? I am currently in the process of creating my own dataset, so I was just wondering why these tags are needed and why the start and end tags are in two subsequent inputs?

https://github.com/mbzuai-oryx/Video-ChatGPT/blob/2e2fcf0e561b34ecc9271c3a8da49d63e891bd40/scripts/convert_instruction_json_to_training_format.py#L46C1-L50C111

Thanks for your response.

Quest2GM

mmaaz60 commented 2 weeks ago

Hi @Quest2GM,

I appreciate your interest in our work. The discussion at https://github.com/mbzuai-oryx/Video-ChatGPT/issues/10 would be helpful address your question.