rese1f / MovieChat

[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
https://rese1f.github.io/MovieChat/
BSD 3-Clause "New" or "Revised" License
534 stars 41 forks source link

About quantitative evaluation #31

Closed HJYao00 closed 11 months ago

HJYao00 commented 11 months ago

Hi, thanks for your work. I am curious to know what you did to improve the quantitative evaluation results from v1 to v2. Thanks.

Espere-1119-Song commented 11 months ago

Thank you for your interest. The quantitative evaluation results improves due to the hyperparameter settings. As you can see in the hyper-parameter ablations part in the paper (Figure 5), the performance of MovieChat degrades when all four are significantly changed. In the paper v1, we just try a group of hyperparameter settings to demonstrate the effectiveness of our approach. In the paper v2, we experiment with a large number of hyperparameter settings and select the best set of data.

HJYao00 commented 11 months ago

Thanks for your quick reply.

Espere-1119-Song commented 11 months ago

:)

HJYao00 commented 10 months ago

Hi, I have two question.

  1. I notice that you used Qformer before inputting into short-term memory (https://github.com/rese1f/MovieChat/blob/main/MovieChat/models/moviechat.py#L277). But after long-term memory, you used QFormer again (https://github.com/rese1f/MovieChat/blob/main/MovieChat/models/moviechat.py#L407). So, did you use QFormer twice? And is the frame_hidden_state on line 407 obtained from the first QFormer?
  2. Is the Qformer all using BLIP2?
Espere-1119-Song commented 10 months ago

We use QFormer twice as written in paper. line 407 is the definition of video_query_output , and it uses the frame_hidden_state obtained from the first QFormer. For detailed information of QFormer, please refer to our paper or VideoLLaMA.