mbzuai-oryx / Video-ChatGPT

[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.
https://mbzuai-oryx.github.io/Video-ChatGPT
Creative Commons Attribution 4.0 International
1.05k stars 92 forks source link

Can I apply your weights to vicuna 13b or 33b? #36

Closed msarkeshi closed 9 months ago

hanoonaR commented 11 months ago

Hi @msarkeshi,

The Vicuna 13B and 33B models have different hidden dimensions compared to the Vicuna 7B. As a result, our linear layer projections, which have been tuned specifically for the 7B model, will not be directly compatible with these models.

Thank you.

mehdisarkeshi commented 11 months ago

Thanks. This is great work. I am seeing some hallucinations with some of the videos I try. E.g. it sees people that don't exist or objects (like cell phone) that are not in the video. Is upgrading to a bigger LLM a solution to this? Is there any parameter (other than temperature) that I can play with to minimize hallucinations?

mmaaz60 commented 9 months ago

Hi @mehdisarkeshi,

One of the quickest solutions could be to explicitly instruct the model to be brief. However, a better solution could be to use a bigger model and cleaner data.