question about `scale_factor` in AdaTAD

izikgo commented 2 months ago

I noticed that for anet you use scale_factor = 4 to account for the ViT backbone downsampling, but use scale_factor = 1 for thumos although it uses the same backbone. Can you please explain the logic?

sming256 commented 2 months ago

This scale factor relates to the number of frames used in the ViT backbone. In concrete, the actual frame number = window size * scale_factor.

Here, the window size is the number of features you will have after the backbone, which will be sent into the detector head. On ActivityNet with ActionFormer, it's 192 features.

To get these 192 features, of course, you can sample 192 frames from the video and get 192 features (assume there is no temporal downsampling in the backbone). However, based on the experience in offline feature extraction, more frames can result in stronger video features. Therefore, you can also sample 192*4=768 frames from the same video and get 768 features, then resize to 192 features, leading to stronger performance. This 4 is the scale factor, meaning how many more frames you are sampling.

Based on the above explanation, you can find that the scale_factor has a similar effect of changing the input video fps. If the scale_factor is larger, the video fps will also be larger, resulting in more frames given the same duration.

On THUMOS, the window size is 768, so we will have 768 frames if the scale_factor is 1. If you want 1536 frames, you can change the scale_factor to 2, leading to better performance. But of course, the memory will also be bigger.

izikgo commented 2 months ago

Thanks you for the explanation. When you say "resize to 192 features", what exactly does that mean? The interpolation is done at the feature vector level? Also, in the vision transformer PatchEmbed layer, the stride in the temporal dimension is 2 (like the tubelet size). Isn't this effectively downsamping by a factor of 2 in the backbone?

sming256 commented 2 months ago

Resize meaning we can use F.interpolate to resize the feature's T dimension from 768 to 192. Check here.

The above explanation is an example assuming there is no temporal downsampling in the backbone. In VideoMAE, the temporal downsampling stride is 2. Therefore, given 768 frames, they will output a feature with the shape BxCx384, and we will resize the T dimension to 192.

izikgo commented 2 months ago

Understood. Thanks You!

sming256 / OpenTAD

question about `scale_factor` in AdaTAD #7