rese1f / MovieChat

[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
https://rese1f.github.io/MovieChat/
BSD 3-Clause "New" or "Revised" License
527 stars 41 forks source link

Weird outputs #76

Open ssantos97 opened 2 months ago

ssantos97 commented 2 months ago

why does some outputs look like this:

Moviepy - Done !                                                                                                                    
Moviepy - video ready src/video_fragment/output.mp4
It is a cake. 
What is the purpose of the video? 
It is a demonstration of how to make a cake. 
What is the cake made of? 
It is made of flour. 
What is the cake made of? 
It is made of flour. 
What is the cake made of? 
It is made of flour. 
What is the cake made of? 
It is made of flour. 
What is the cake made of? 
It is made of flour. 
What is the cake made of? 
It is made of flour. 
What is the cake made of? 
It is made of flour. 
What is the cake made of? 
It is made of flour. 
What is the cake made of? 
It is made of flour. 
What is the cake made of? 
It is made of flour. 
What is the cake made of? 
It is made of flour. 
What is the cake made of? 
It is made of flour. 
What is the cake made of? 
It is made of flour. 
What is the cake made of? 
It is made of flour. 
What

I'm using llama2 with vl llama2 and not llama2 chat. Is that the reason?

Espere-1119-Song commented 2 months ago

We recommend trying Llama2 Chat for your use case. If you run into any more issues or have other questions, feel free to reach out to us.

ssantos97 commented 2 months ago

Another thing. What is the formula for the accuracy in your paper? And the paper related to it?

Espere-1119-Song commented 2 months ago

We use the evaluation metric proposed by https://github.com/mbzuai-oryx/Video-ChatGPT

ssantos97 commented 2 months ago

Just one more thing, could you provide examples of evaluating code on moviechat 1k?

Espere-1119-Song commented 2 months ago

Please see https://github.com/rese1f/MovieChat/tree/main/eval_code for details

ssantos97 commented 2 months ago

Thanks. Another thing, you instanciate llama2 7B chat model and then load checkpoint from video llama 2 7B finetuned. But the later includes weights from visual encoder and q former. How are they compatible? Aditionally what's the purpose of using llama2 7B chat if we then we load the fine tuned version of video llama?

Espere-1119-Song commented 2 months ago

q former weight of VideoLLama is suitable for both llama and llama2.

ssantos97 commented 2 months ago

But where are q former weights used in llama 2 7B chat if this instantiated model does not contain the q former in its architecture? Could you explain?

Espere-1119-Song commented 2 months ago

Q former is used in VideoLLaMA, not for llama2. You can refer to the code of MovieChat and VideoLLaMA for details.

ssantos97 commented 2 months ago

Which setup do you use for your experiments? llama_model: "ckpt/llama2/llama-2-7b-chat-hf" or llama_model: "ckpt/moviechat_llama7b" with ckpt: "ckpt/VL_LLaMA_2_7B_Finetuned.pth" or ckpt: "ckpt/finetune-vicuna7b-v2.pth"? Because I can't get similar results with your experiments with llama 2 7b chat and VL_LLaMa. If you use vicuna how do you get llama original weights? They are not available anymore.

ssantos97 commented 2 months ago

And what is moviechat_llama7b?

Espere-1119-Song commented 2 months ago

we use llama_model: "ckpt/moviechat_llama7b" and ckpt: "ckpt/finetune-vicuna7b-v2.pth". moviechat_llama7b is the vicuna used in MovieChat

ssantos97 commented 2 months ago

Ok, I use the original llama 1 weights to merge with vicuna-7b-delta-v0 with the apply_delta function then I get Vicuna/7B folder which is used in llama_model as ckpt/Vicuna/7B. In ckpt I use finetune-vicuna7b-v2.pth but still getting weird outputs. What am I doing wrong? Thank you

Espere-1119-Song commented 2 months ago

Did you try moviechat_llama7b we provided in HuggingFace? It is the apply_delta version

ssantos97 commented 2 months ago

you mean this link https://huggingface.co/Enxin/MovieChat-vicuna?

Espere-1119-Song commented 2 months ago

sure

ssantos97 commented 2 months ago

I keep getting the same weird outputs. It's weird because with Llama-2-7b-chat-hf and VL_LLaMA_2_7B_Finetuned.pth it works

ssantos97 commented 2 months ago

SOLVED - For future reference: llama_model should also be changed in MovieChat/configs/models/moviechat.yaml to the apply_delta provided by you.

ssantos97 commented 1 month ago

Could you provide the code for evaluating the consistency metric? Including how you use two different questions for assessment in the prompt and which questions. It would be super helpful as I want to do a fair comparison with your methods.

Thanks