songweige / content-debiased-fvd

[CVPR 2024] On the Content Bias in Fréchet Video Distance
https://content-debiased-fvd.github.io/
MIT License
73 stars 3 forks source link

Request for Download Link for VideoMAEv2 Pretraining Model Checkpoint #8

Closed oooolga closed 1 week ago

oooolga commented 2 weeks ago

Hi,

Can you confirm that the model provided in the code is the VideoMAE v2 model fine-tuned on the SSv2 dataset? Additionally, is a pre-trained (not fine-tuned) VideoMAE model available, and if so, can you provide the link?

Thank you for your help!

songweige commented 2 weeks ago

Hi Olga,

Yes! The provided code, by default, uses the checkpoint fine-tuned on ssv2, but it should be able to load any VideoMAE-v2 checkpoint. They do have a pre-trained VideoMAE model, which you can find here. Hope this helps!

oooolga commented 2 weeks ago

Thanks for the swift reply!

oooolga commented 1 week ago

Hi Songwei,

I've downloaded the vit_g_hybrid_pt_1200e.pth model from here. However, when I try to use your model loader to load the model using the following lines:

from cdfvd.third_party.VideoMAEv2.utils import load_videomae_model
self.model = load_videomae_model(torch.device(device), 'vit_g_hybrid_pt_1200e.pth')

I've received the following error: Error(s) in loading state_dict for VisionTransformer: Missing key(s) in state_dict: "patch_embed.proj.weight", "patch_embed.proj.bias", "blocks.0.norm1.weight", "blocks.0.norm1.bias", "blocks.0.attn.q_bias", "blocks.0.attn.v_bias", "blocks.0.attn.qkv.weight", "blocks.0.attn.proj.weight", "blocks.0.attn.proj.bias", "blocks.0.norm2.weight", "blocks.0.norm2.bias", "blocks.0.mlp.fc1.weight", "blocks.0.mlp.fc1.bias", "blocks.0.mlp.fc2.weight", "blocks.0.mlp.fc2.bias", ...

Can I use the load_video_mae_model function to load the vit_g_hybrid_pt_1200e.pth model, or is it only compatible with ssv2 models?

oooolga commented 1 week ago

Hi Songwei,

I've managed to resolve the loading issue of the pretrained models by modifying the load_videomae_model function and importing the pretrain_videomae_giant_patch14_224 model from cdfvd.third_party.VideoMAEv2.videomaev2_pretrain.

However, I'm now curious about the feature extraction process using the pretrained model, as described in your paper. Specifically, I'd like to know if the feature extraction was performed similarly to the following code snippet: self.model.encoder.forward_features(videos*255, mask=...)

If so, could you please clarify what value was used for the mask parameter in this context? Was it a tensor of ones (unmasking all patches)?

Thank you for your time and assistance!

Olga

oooolga commented 1 week ago

Updated question: In your paper, you mentioned that features are extracted from the pretrained VideoMAE encoder-decoder architecture by taking the output of the prelogit layer in the encoder and averaging across all patches.

Based on this description, I'm wondering if the feature extraction code for the pretrained model is similar to the following: self.model.encoder.forward_features(videos*255, torch.zeros(videos.shape[0],2048,1408).to(torch.bool)).mean(dim=1)

Could you please confirm if this is the correct interpretation of the feature extraction process described in your paper? Or if there's any discrepancy, can you please provide the correct code snippet for feature extraction using the pretrained VideoMAE model? Thank you!

songweige commented 1 week ago

Hi Olga, this is what I did before:

mask = torch.zeros([16, 2048, 1408]).to(torch.bool).cuda() if  'vit_g_hybrid_pt_1200e.pth' in ckpt_path else None
features = model.encoder.forward_features(input_data, mask=mask).mean(1)
stats.append_torch(features, num_gpus=1, rank=0)

It seems that the only difference is that the input range should be [0, 1] for the function model.encoder.forward_features?

oooolga commented 1 week ago

Thanks for the clarification. Super helpful and will definitely check my input range. 😀

oooolga commented 1 week ago

@songweige Thanks a ton, Songwei! You've saved us from a major bug in our code. I was under the impression that the input to the VideoMAE network was 0-255 when I saw lines 133 and 155. However, your comment made me realize that you had rescaled it to 0-1 in here - that was a huge catch!

I have a follow-up question regarding preprocessing, related to this issue: issue link. It appears that you didn't normalize the features using the mean [0.485, 0.456, 0.406] and standard deviation [0.229, 0.224, 0.225]. Am I correct?

Thanks again for all your help. We appreciate it and will definitely acknowledge your help in our project!

songweige commented 1 week ago

Hi Olga, thank you for your kind words and I think you are correct. I mainly followed this function to extract the features from the VideoMAE models and didn't check their training code before.

It looks like they did normalization as part of the augmentation during both training and fine-tuning. It would be good to know from the authors what is the proper way to do preprocessing during the inference!