Closed oooolga closed 1 week ago
Hi Olga,
Yes! The provided code, by default, uses the checkpoint fine-tuned on ssv2, but it should be able to load any VideoMAE-v2 checkpoint. They do have a pre-trained VideoMAE model, which you can find here. Hope this helps!
Thanks for the swift reply!
Hi Songwei,
I've downloaded the vit_g_hybrid_pt_1200e.pth
model from here. However, when I try to use your model loader to load the model using the following lines:
from cdfvd.third_party.VideoMAEv2.utils import load_videomae_model
self.model = load_videomae_model(torch.device(device), 'vit_g_hybrid_pt_1200e.pth')
I've received the following error:
Error(s) in loading state_dict for VisionTransformer: Missing key(s) in state_dict: "patch_embed.proj.weight", "patch_embed.proj.bias", "blocks.0.norm1.weight", "blocks.0.norm1.bias", "blocks.0.attn.q_bias", "blocks.0.attn.v_bias", "blocks.0.attn.qkv.weight", "blocks.0.attn.proj.weight", "blocks.0.attn.proj.bias", "blocks.0.norm2.weight", "blocks.0.norm2.bias", "blocks.0.mlp.fc1.weight", "blocks.0.mlp.fc1.bias", "blocks.0.mlp.fc2.weight", "blocks.0.mlp.fc2.bias", ...
Can I use the load_video_mae_model function to load the vit_g_hybrid_pt_1200e.pth
model, or is it only compatible with ssv2 models?
Hi Songwei,
I've managed to resolve the loading issue of the pretrained models by modifying the load_videomae_model
function and importing the pretrain_videomae_giant_patch14_224
model from cdfvd.third_party.VideoMAEv2.videomaev2_pretrain
.
However, I'm now curious about the feature extraction process using the pretrained model, as described in your paper. Specifically, I'd like to know if the feature extraction was performed similarly to the following code snippet:
self.model.encoder.forward_features(videos*255, mask=...)
If so, could you please clarify what value was used for the mask parameter in this context? Was it a tensor of ones (unmasking all patches)?
Thank you for your time and assistance!
Olga
Updated question: In your paper, you mentioned that features are extracted from the pretrained VideoMAE encoder-decoder architecture by taking the output of the prelogit layer in the encoder and averaging across all patches.
Based on this description, I'm wondering if the feature extraction code for the pretrained model is similar to the following:
self.model.encoder.forward_features(videos*255, torch.zeros(videos.shape[0],2048,1408).to(torch.bool)).mean(dim=1)
Could you please confirm if this is the correct interpretation of the feature extraction process described in your paper? Or if there's any discrepancy, can you please provide the correct code snippet for feature extraction using the pretrained VideoMAE model? Thank you!
Hi Olga, this is what I did before:
mask = torch.zeros([16, 2048, 1408]).to(torch.bool).cuda() if 'vit_g_hybrid_pt_1200e.pth' in ckpt_path else None
features = model.encoder.forward_features(input_data, mask=mask).mean(1)
stats.append_torch(features, num_gpus=1, rank=0)
It seems that the only difference is that the input range should be [0, 1] for the function model.encoder.forward_features
?
Thanks for the clarification. Super helpful and will definitely check my input range. 😀
@songweige Thanks a ton, Songwei! You've saved us from a major bug in our code. I was under the impression that the input to the VideoMAE network was 0-255 when I saw lines 133 and 155. However, your comment made me realize that you had rescaled it to 0-1 in here - that was a huge catch!
I have a follow-up question regarding preprocessing, related to this issue: issue link. It appears that you didn't normalize the features using the mean [0.485, 0.456, 0.406]
and standard deviation [0.229, 0.224, 0.225]
. Am I correct?
Thanks again for all your help. We appreciate it and will definitely acknowledge your help in our project!
Hi Olga, thank you for your kind words and I think you are correct. I mainly followed this function to extract the features from the VideoMAE models and didn't check their training code before.
It looks like they did normalization as part of the augmentation during both training and fine-tuning. It would be good to know from the authors what is the proper way to do preprocessing during the inference!
Hi,
Can you confirm that the model provided in the code is the VideoMAE v2 model fine-tuned on the SSv2 dataset? Additionally, is a pre-trained (not fine-tuned) VideoMAE model available, and if so, can you provide the link?
Thank you for your help!