Why is the <video> tag is needed in training json?

mbzuai-oryx / Video-ChatGPT

[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.

Creative Commons Attribution 4.0 International

1.05k stars 92 forks source link

Hi @mmaaz60,

I am wondering why it is critical to include the video start tag start in one human conversation input and then the video end tag in the subsequent human input? I am currently in the process of creating my own dataset, so I was just wondering why these tags are needed and why the start and end tags are in two subsequent inputs?

https://github.com/mbzuai-oryx/Video-ChatGPT/blob/2e2fcf0e561b34ecc9271c3a8da49d63e891bd40/scripts/convert_instruction_json_to_training_format.py#L46C1-L50C111

Thanks for your response.

Quest2GM

mbzuai-oryx / Video-ChatGPT

Why is the <video> tag is needed in training json? #107