rese1f / MovieChat

[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
https://rese1f.github.io/MovieChat/
BSD 3-Clause "New" or "Revised" License
534 stars 41 forks source link

Frame-aware? #40

Closed jayavanth closed 9 months ago

jayavanth commented 9 months ago

Hello! I wanted to know if this model is frame aware? Can I ask questions like "when does the person wearing yellow jacket appear in this video?" Doesn't seem like models like VideoChat can do it based on demos on huggingface but in the paper I saw a figure where Video-Llama could say which frame it occurred in. Is MovieChat able to do that?

Espere-1119-Song commented 9 months ago

Thank you for your insterest.

MovieChat uses a video Q-Former to encode temporal information. If you ask questions like "when does the person wearing yellow jacket appear in this video?", the answer you will probably get is "It appears after ....( something happen)". We found in the experiment that although MovieChat may give frame aware answers (see Figure F2 in paper for details), it is not accurate enough. This is because no specific time information is involved during the training process.

In addition, MovieChat uses Video-LLaMA as its base model, but I checked the Video-LLaMA paper and did not find the frame aware example you mentioned. Can you share a screenshot with me?

jayavanth commented 9 months ago

Thanks Enxin! This is the example where Video-Llama references the frame number

Screenshot 2024-02-06 at 10 23 02 AM

Espere-1119-Song commented 9 months ago

Thanks for your support. Referring to the structure of Video-Llama, I think it gives such the answer just by chance.

jayavanth commented 9 months ago

I see. Thank you!