Closed Andy1621 closed 2 years ago
Zero-pad means the first 4 frames use the original temporal embedding, and then the remaining frames are given new temporal embeddings initialised to zero. See this function in the model.py
file for the different methods.
https://github.com/m-bain/frozen-in-time/blob/873c4967258eeabd88b6c0fc448e8882f95d0736/model/model.py#L115
One thing to note is that in the paper we found little effect on performance with different temporal inflation methods. This is because for most tasks the positional embeddings barely make a difference (in particular text-video matching). So unless your task really requires temporal positions, I wouldn't worry too much about the positional embeddings.
The problem may be missed in the issue42 When increasing the input number of frames from 4 to 8, how does Zero-pad work? The center 4 frames use the original temporal embedding, and the other frames use zero padding?