Closed swethakrishn closed 4 weeks ago
Thanks for your question. Just to clarify, are you expecting a shorter and more concise answer for this query?
No, but the answer on the MovieChat-1K video does not answer the question
You can try reducing the number of max new tokens. We also acknowledge that since MovieChat sends more visual tokens to the LLM without additional training, it might affect the model’s instruction-following capability to some extent.
Reduced max_new_tokens, but the output still does not answer the question; there's no mention of the word 'door'.
Resolved; using eval_code/result_prepare/run_inference_qa_moviechat.py
gives expected results.
Thanks for your work MovieChat and the dataset.
I tried running the inference as per the instructions, this time with a video from the MovieChat-1K test set on global mode:
!python inference.py --cfg-path eval_configs/MovieChat.yaml --gpu-id 0 --num-beams 1 --temperature 1.0 --text-query "Where does the man on the boat finally go?" --video-path src/examples/1.mp4 --fragment-video-path src/video_fragment/output.mp4 --cur-min 0 --cur-sec 0 --middle-video 0
I set the MovieChat.yaml parameters as:
The output I get is:
Answer: In the first scene, the man is standing on a boat, looking at the ocean. In the second scene, he is in the water on the boat. In the third scene, he is walking on the steps of the boat. In the fourth scene, the man is standing on a boat with a woman on the dock. In the fifth scene, he is standing on a boat looking at the ocean. In the sixth scene, he is sitting on a boat with a woman on the dock. In the seventh scene, he is standing on a boat looking at the ocean. In the eighth scene, he is sitting on the boat looking at the ocean. In the ninth scene, he is standing on a boat looking at the ocean. In the tenth scene, the man is standing on a boat with a woman on the dock. In the eleventh scene, he is standing on a boat looking at the ocean. Finally, in the twelfth scene, he is standing on a boat with a woman on the dock. Throughout the video, we can see the man wearing a tie, a green tie, a brown shirt, and a black turtleneck. The scenes are also filled with boats, water, and the ocean.
The ground truth as per the dataset is:
Fyi, using the cooking cake video gave this answer: Question: What is he doing? Answer: In the first scene, a man is standing in a kitchen, holding a knife and cutting a pie in a glass on a white surface. The next scene shows a man using a metal spoon to mix dough in a clear glass on a white kitchen surface. He then pours the dough into a glass, and a man is seen in the kitchen with a pie on a plate. The video then shows a man pouring a liquid into a glass on a white countertop, and he then makes dough in a glass on a white surface. Finally, a man is seen in the kitchen with a pie on a plate on a white countertop. He is then shown using a spoon to mix dough in a glass on a white surface. Throughout the video, different types of food and kitchen items can be seen, such as eggs, pizza, bread, doughnuts, cookies, pastries, plates, bowls, pots, spoon, knife, metal, and water.
Clearly there's something off. What am I doing wrong?