mbzuai-oryx / VideoGPT-plus

Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Creative Commons Attribution 4.0 International
128 stars 6 forks source link

You mean phi3 surpassed mistral7B? #1

Closed MonolithFoundation closed 2 weeks ago

MonolithFoundation commented 3 weeks ago

I think it really really out of expect, how will a phi3 model surpass mistaral7B, in the case of VideChat2 using a gaint vision encoder? Which part could be really work one?

mmaaz60 commented 3 weeks ago

Hi @MonolithFoundation,

I appreciate your interest in our work. As per VideoChat2 paper, they have reported an average of 60.4 on MVBench with Mistral-7B LLM. In our case, VideoGPT+ obtains 58.7 average score on MVBench with Phi-3-mini-3.8B LLM.

We have released all the model checkpoints, training, and evaluation codes to reproduce our reported results. I hope this will help.

Please let me know if you have any questions. Thank You.

zimenglan-sysu-512 commented 2 weeks ago

can eight v100 GPUs train the model?

lucasjinreal commented 2 weeks ago

@mmaaz60 From the first picture, VideoGpT+ surpassed VideChat2 with a clearly margin, but VideChat2 with mistral actually got better result as for now.

Current days Video MLLMs actually didn't really care about which LLM size they using...

mmaaz60 commented 2 weeks ago

can eight v100 GPUs train the model?

Hi @zimenglan-sysu-512

I appreciate your interest in our work. As we are using Phi-3-Mini with 3.8B model as LLM, the model can be trained easily on 8 V100 GPUs with 32GB memory per GPU. However, we have to turn off the flash attention as it is not supported for V100 GPUs.

I hope it will help. Good Luck! And please let me know if you face any issues.

mmaaz60 commented 2 weeks ago

@mmaaz60 From the first picture, VideoGpT+ surpassed VideChat2 with a clearly margin, but VideChat2 with mistral actually got better result as for now.

Current days Video MLLMs actually didn't really care about which LLM size they using...

Hi @lucasjinreal

Thank you for your interest in our work. VideoGPT+ is using Phi-3-mini LLM with only 3.8B parameters, and is relatively weaker as compared to Mistral-7B.

On the other hand, if we compare the Vicuna 7B based models for both VideoGPT+ and VideoChat2, we noticed that VideoChat2 obtains 51.1 on average on MVBench, and our Vicuna 7B based variant obtains 53.1 average score.

Further, there are gains in VCGBench and VCGBench-Diverse evaluations as well.

We acknowledge that VideoChat2 is a strong video conversation model, however, our VideoGPT+ obtains better results on multiple benchmarks as discussed in our technical report and all the codes to reproduce our reported results are released on the GitHub.

image

image