Closed liangying-Ke closed 1 year ago
Hello,
We feed 3 dimensional feature vectors to the classifier (Time Length, Batch, Channels). The Linear modules consider this as 2-dimensional (LxB, C). It is like we use .view(L*B,C)
implicitly. This is why we do permute
to (B, C, L) and then pool1d
on the temporal dimension L, and additional softmax pooling if we are in eval mode with long videos (as opposed to short training clips).
Do you have an actual error when running the model or just reading the code ? If you have an error can you check the dimensions of your inputs to the vgg model ? I commented in the code with the feature dimensions at each step of the model, from (B, C, L, H, W)
input to (B, C)
output scores. L is temporal, HxW are spatial dimensions.
Hi, Thanks for your implement. But, there is a bug in the forward function of model. Because the feature dimension of model after the Linear Module will be two, you couldn't use permute to change the dimension.
Is the max_pool1d necessary? Maybe the output of Linear Module is almost the scores?