m-bain / frozen-in-time

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [ICCV'21]
https://arxiv.org/abs/2104.00650
MIT License
348 stars 44 forks source link

What does zero-pad means? #43

Closed Andy1621 closed 2 years ago

Andy1621 commented 2 years ago

The problem may be missed in the issue42 When increasing the input number of frames from 4 to 8, how does Zero-pad work? The center 4 frames use the original temporal embedding, and the other frames use zero padding?

m-bain commented 2 years ago

Zero-pad means the first 4 frames use the original temporal embedding, and then the remaining frames are given new temporal embeddings initialised to zero. See this function in the model.py file for the different methods. https://github.com/m-bain/frozen-in-time/blob/873c4967258eeabd88b6c0fc448e8882f95d0736/model/model.py#L115

One thing to note is that in the paper we found little effect on performance with different temporal inflation methods. This is because for most tasks the positional embeddings barely make a difference (in particular text-video matching). So unless your task really requires temporal positions, I wouldn't worry too much about the positional embeddings.