Open Leon1207 opened 1 month ago
For the "time" of breakpoint mode, it refers to the moment when the video is paused during evaluation. You can think of it as the point where the model needs to analyze and respond based on the content leading up to that specific timestamp. Most questions in breakpoint mode are related to the context from the recent video segments before the “time.”
For LLM-based video understanding models evaluation, to solve breakpoint mode, we simulate long-video understanding by sampling frames evenly from the video clips leading up to the “time.”
Thanks for your reply! So time=750 means the video clip from frame 0 ~ frame 750 that the model needs to concern about?
We don't specify a start time that the model needs to consider; time=750 only represents the ending frame.
If I want to evaluate my LLM-based video understanding model, I can uniform sample the frames between frame 0 ~ frame 750 in breakpoint mode?
It depends on your strategy
Thanks for your explanation~
Hi, thanks for your work. I would like to ask about what does the field "time" of breakpoint mode omean in your json file? And when you evaluate LLM-based video understanding models like VideoChat, does breakpoint mode act more like single-image perception rather than long-video understanding? thanks!