mbzuai-oryx / Video-ChatGPT

[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.
https://mbzuai-oryx.github.io/Video-ChatGPT
Creative Commons Attribution 4.0 International
1.05k stars 92 forks source link

Alternative choices for linear layer #20

Closed Kratos-Wen closed 12 months ago

Kratos-Wen commented 1 year ago

Thank you for the very excellent work! In your paper you mentioned that you experimented with more complex network models in addition to linear layers, will you publish the details and evaluation results of the other attempts?

Thanks in advance!

mmaaz60 commented 12 months ago

Hi @Kratos-Wen,

Thanks for your interest in our project and for bringing up this question.

In our experiments, we found a linear layer architecture met our performance objectives quite effectively. Although the potential of more complex designs was noted, we didn't dive into that depth of experimentation and hence, we lack specific numbers related to these designs.

An important point to consider is that these complex models may not necessarily outperform our chosen architecture. Major reason being their inability to initialize from the pretrained LLaVA weights.

Hope this clarifies things a bit. Thank You.

Kratos-Wen commented 12 months ago

Thanks, that pretty much answers my question! Congratulations again on your results!