rese1f / MovieChat

[CVPR 2024] 🎬💭 chat with over 10K frames of video!
https://rese1f.github.io/MovieChat/
BSD 3-Clause "New" or "Revised" License
454 stars 37 forks source link

Questions about extracted features #65

Open Hou9612 opened 1 month ago

Hou9612 commented 1 month ago

Hello,

Greetings for this wonderful work!

The shape of the provided features are [64, 257, 1408], for these provided features, I have the following questions:

(1) What are 257 and 1408 mean? Does 257 indicate the number of tokens of each frame and 1408 indicate feature dim? (2) Can I only use the feature representation of cls token of each frame when training the model and evaluating model performance? The size of the complete feature is about 16T, I don't have enough storage space to restore the complete feature.