The shape of the provided features are [64, 257, 1408], for these provided features, I have the following questions:
(1) What are 257 and 1408 mean? Does 257 indicate the number of tokens of each frame and 1408 indicate feature dim?
(2) Can I only use the feature representation of cls token of each frame when training the model and evaluating model performance? The size of the complete feature is about 16T, I don't have enough storage space to restore the complete feature.
Hello,
Greetings for this wonderful work!
The shape of the provided features are [64, 257, 1408], for these provided features, I have the following questions:
(1) What are 257 and 1408 mean? Does 257 indicate the number of tokens of each frame and 1408 indicate feature dim? (2) Can I only use the feature representation of cls token of each frame when training the model and evaluating model performance? The size of the complete feature is about 16T, I don't have enough storage space to restore the complete feature.