sunilhoho / EVEREST

Official Pytorch implementation of EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens [ICML2024].
https://arxiv.org/abs/2211.10636
20 stars 1 forks source link

Subsampling tokens during inference #1

Closed wren93 closed 1 year ago

wren93 commented 1 year ago

Hi, thanks for sharing this very interesting work. I have a question for finetuning on downstream datasets - the paper mentioned that you use 60% of the tokens for finetuning based on the motion heatmap. Do you also only process 60% of the tokens during inferencing, or you still use all the tokens for inference? Thanks in advance.

sunilhoho commented 1 year ago

Hi! Thanks for having an interest in our work! We only use 60% of tokens for video action recognition during fine-tuning and inference.