showlab / EgoVLP

[NeurIPS2022] Egocentric Video-Language Pretraining
https://arxiv.org/pdf/2206.01670.pdf
222 stars 19 forks source link

number of frames per clip #4

Closed fmu2 closed 1 year ago

fmu2 commented 1 year ago

Thanks for the great work!

I am confused by how "num_frames" is set in video_params in the config files. If I understand correctly, the pre-trained Frozen model has num_frames=16 whereas only four frames are given as input to the model at training and inference time. In Table 4 of the paper, there are two entries for Frozen+EgoNCE with #frames equal to 4 and 16, respectively. I am wondering what is the difference here, and which corresponds to the pre-trained model weights (EgoVLP_PT_BEST) available in the repository? May I still provide 16 frames instead of four to the provided model for feature extraction? Thank you!

QinghongLin commented 1 year ago

Hi @fmu2 , thanks for your interest.

In the pretraining phase, Frozen model with num_frames=16 can support a maximum of 16 frames input but we only use num_frames=4 for input due to computation cost.

In the downstream tasks of Tab. 4, we based on the same pretrained weights EgoVLP_PT_BEST pretrained with 4 frames, and try two variants. One is to use num_frames=4 for fine-tuning (same as pretraining), and we also try num_frames=16 for fine-tuning (though 12 frames temporal position not learned). The latter get better results.

EgoVLP_PT_BEST is corresponding to the Frozen + EgoNCE, with 4 frames pretraining.

For off-line feature extraction e.g., NLQ, I do not recommend 16 frames since only 4 frames are learned during pretraining. But if you want to fine-tune in downstream e.g., Charades-STA, EPIC-Kitchens, 16 frames is a better choice.

Please reach out if you have other issues.

Kevin

fmu2 commented 1 year ago

Thanks for your reply!