VideoMAE missing CLS tokens in embedding

z5163449 commented 1 year ago

System Info

I'm not sure if I've missed something in the code, but I can't seem to find where the CLS tokens are added? I have input data of shape (64,45,2,32,32) with tubelet size = 5, patch_size = 4. This results in a sequence length of 576. From my understanding that is the total number of tubelets. I see that after the data is passed through the embedding layer the final embedding shape is (64,576,768) where 768 is the hidden size. However, should the dimensions not be (64,577,768) since we should be adding a CLS token to the sequence?

Would be great to get hear back soon because I'm not sure if I'm wrong or if there is something wrong with the code.

Thanks! @NielsRogge

Reproduction

pixel_values = torch.randn(1,45, 2, 32, 32) config = VideoMAEConfig() config.num_frames = 45 config.image_size = 32 config.patch_size = 4 config.tubelet_size = 5 config.num_channels = 2

num_patches_per_frame = (model.config.image_size // model.config.patch_size) * 2 seq_length = (model.config.num_frames // model.config.tubelet_size) num_patches_per_frame print(seq_length.shape)

videomae = VideoMAEModel(config) output = videomae(pixel_values, output_hidden_states=True) sequence_output = output[0] print(sequence_output.shape)

Expected behavior

seq_length = 576 sequence_output = (1,577,768) The embedding sequence length should be total number of tubelets + 1

NielsRogge commented 1 year ago

Hi,

VideoMAE doesn't use a CLS token, so this can be fixed in the docstring. The number of tokens sent through the Transformer equals (number of frames // tubelet_size) (height // patch_size) (width // patch_size).

For video classification, the authors average pool the final hidden states of the tokens before applying a final classification head.

Do you mind opening a PR to fix this docstring?

avocardio commented 1 year ago

@NielsRogge Hi, sorry for coming back to this, and this may be a more general question, but why would the authors use the final hidden states of the model (that would more closely resemble the inputs again), instead of an intermediate state? I know the shapes are fixed and its not a compressing autoencoder, but why the last hidden state?

NielsRogge commented 1 year ago

People typically use the last hidden states of Transformer-based models as features for classification layers. One of the first papers that did this was BERT.

huggingface / transformers