rese1f / MovieChat

[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
https://rese1f.github.io/MovieChat/
BSD 3-Clause "New" or "Revised" License
531 stars 41 forks source link

Running inference.py #84

Closed swethakrishn closed 4 weeks ago

swethakrishn commented 1 month ago

Thanks for your work MovieChat and the dataset.

I tried running the inference as per the instructions, this time with a video from the MovieChat-1K test set on global mode:

!python inference.py --cfg-path eval_configs/MovieChat.yaml --gpu-id 0 --num-beams 1 --temperature 1.0 --text-query "Where does the man on the boat finally go?" --video-path src/examples/1.mp4 --fragment-video-path src/video_fragment/output.mp4 --cur-min 0 --cur-sec 0 --middle-video 0

I set the MovieChat.yaml parameters as:

llama_model: "Enxin/MovieChat-vicuna"
llama_proj_model: "dldweights/pretrained_minigpt4.pth"
ckpt: "dldweights/finetune-vicuna7b-v2.pth"

The output I get is:

...
Moviepy - Done !
Moviepy - video ready src/video_fragment/output.mp4
127
Moviepy - Building video src/video_fragment/output.mp4.
MoviePy - Writing audio in outputTEMP_MPY_wvf_snd.mp3
MoviePy - Done.
Moviepy - Writing video src/video_fragment/output.mp4

Moviepy - Done !
Moviepy - video ready src/video_fragment/output.mp4
2024-10-22 13:20:49.162817: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-22 13:20:49.181480: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-22 13:20:49.186942: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-22 13:20:50.529819: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

Answer: In the first scene, the man is standing on a boat, looking at the ocean. In the second scene, he is in the water on the boat. In the third scene, he is walking on the steps of the boat. In the fourth scene, the man is standing on a boat with a woman on the dock. In the fifth scene, he is standing on a boat looking at the ocean. In the sixth scene, he is sitting on a boat with a woman on the dock. In the seventh scene, he is standing on a boat looking at the ocean. In the eighth scene, he is sitting on the boat looking at the ocean. In the ninth scene, he is standing on a boat looking at the ocean. In the tenth scene, the man is standing on a boat with a woman on the dock. In the eleventh scene, he is standing on a boat looking at the ocean. Finally, in the twelfth scene, he is standing on a boat with a woman on the dock. Throughout the video, we can see the man wearing a tie, a green tie, a brown shirt, and a black turtleneck. The scenes are also filled with boats, water, and the ocean.

The ground truth as per the dataset is:

{
      "answer": "Into a door.", 
      "question": "Where does the man on the boat finally go?"
    }

Fyi, using the cooking cake video gave this answer: Question: What is he doing? Answer: In the first scene, a man is standing in a kitchen, holding a knife and cutting a pie in a glass on a white surface. The next scene shows a man using a metal spoon to mix dough in a clear glass on a white kitchen surface. He then pours the dough into a glass, and a man is seen in the kitchen with a pie on a plate. The video then shows a man pouring a liquid into a glass on a white countertop, and he then makes dough in a glass on a white surface. Finally, a man is seen in the kitchen with a pie on a plate on a white countertop. He is then shown using a spoon to mix dough in a glass on a white surface. Throughout the video, different types of food and kitchen items can be seen, such as eggs, pizza, bread, doughnuts, cookies, pastries, plates, bowls, pots, spoon, knife, metal, and water.

Clearly there's something off. What am I doing wrong?

Espere-1119-Song commented 1 month ago

Thanks for your question. Just to clarify, are you expecting a shorter and more concise answer for this query?

swethakrishn commented 1 month ago

No, but the answer on the MovieChat-1K video does not answer the question

Espere-1119-Song commented 1 month ago

You can try reducing the number of max new tokens. We also acknowledge that since MovieChat sends more visual tokens to the LLM without additional training, it might affect the model’s instruction-following capability to some extent.

swethakrishn commented 1 month ago

Reduced max_new_tokens, but the output still does not answer the question; there's no mention of the word 'door'.

swethakrishn commented 4 weeks ago

Resolved; using eval_code/result_prepare/run_inference_qa_moviechat.py gives expected results.