Why is important to change the order of content['q'] and <video>?

mbzuai-oryx / Video-ChatGPT

[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.

https://mbzuai-oryx.github.io/Video-ChatGPT

Creative Commons Attribution 4.0 International

1.17k stars 103 forks source link

Why is important to change the order of content['q'] and <video>? #10

Closed shufangxun closed 1 year ago

shufangxun commented 1 year ago

As the comment says: https://github.com/mbzuai-oryx/Video-ChatGPT/blob/main/scripts/convert_instruction_json_to_training_format.py#L28

mmaaz60 commented 1 year ago

Hi @shufangxun,

Thank You for your interest in our work. The comment is referring to the importance of adding <video> token to the annotation which is going to be replaced with the actual video content during training. It is not referring to the order of content['q'] and <video>. However, during training this alternate order could be considered as form of regularization.

Please let me know if it clarifies the confusion. Thank you

shufangxun commented 1 year ago

Thanks for your reply! by the way, how much is the performance difference between these two ways？

mmaaz60 commented 1 year ago

Hi @shufangxun,

In our training, in half of the data samples we kept the text prompt after the video tokens and in other half before the video tokens. However, we do not conduct any ablation to find which method is more effective. The performance difference should be negligible as we are not updating the LLMs' parameters.

Please let me know if you have any questions or insights from your experiments.