rese1f / MovieChat

[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
https://rese1f.github.io/MovieChat/
BSD 3-Clause "New" or "Revised" License
507 stars 40 forks source link

Some question about timestamp. #82

Open Leon1207 opened 2 days ago

Leon1207 commented 2 days ago

Hi, thanks for your work. I would like to ask about what does the field "time" of breakpoint mode omean in your json file? And when you evaluate LLM-based video understanding models like VideoChat, does breakpoint mode act more like single-image perception rather than long-video understanding? thanks!

Espere-1119-Song commented 2 days ago

For the "time" of breakpoint mode, it refers to the moment when the video is paused during evaluation. You can think of it as the point where the model needs to analyze and respond based on the content leading up to that specific timestamp. Most questions in breakpoint mode are related to the context from the recent video segments before the “time.”

For LLM-based video understanding models evaluation, to solve breakpoint mode, we simulate long-video understanding by sampling frames evenly from the video clips leading up to the “time.”

Leon1207 commented 2 days ago

Thanks for your reply! So time=750 means the video clip from frame 0 ~ frame 750 that the model needs to concern about?

Espere-1119-Song commented 2 days ago

We don't specify a start time that the model needs to consider; time=750 only represents the ending frame.

Leon1207 commented 2 days ago

If I want to evaluate my LLM-based video understanding model, I can uniform sample the frames between frame 0 ~ frame 750 in breakpoint mode?

Espere-1119-Song commented 2 days ago

It depends on your strategy

Leon1207 commented 2 days ago

Thanks for your explanation~