I understand that in the original paper, the authors apply the double attention block to video data. From reading the paper, I understand how to apply the double attention block between 2D conv layers, such that higher-level features are weighted and combined with lower-level features.
I can't figure out how this implementation would apply to a 5D temporal input -- Batch, Time, Height, Width, Channels. I understand that the first step, feature gathering, involves a dimension reduction, 1x1 convolutions, softmax, and bilinear pooling. Should the data be reshaped to be (B, H, W, CxT)? That seems to be my inclination from the paper -- "where each b is a dhw-dimensional row vector" -- it seems that the output of the gathering stage is dxhxw size, and doesn't incorporate the input channel size because the conv is 1x1x1.
I understand that in the original paper, the authors apply the double attention block to video data. From reading the paper, I understand how to apply the double attention block between 2D conv layers, such that higher-level features are weighted and combined with lower-level features.
I can't figure out how this implementation would apply to a 5D temporal input -- Batch, Time, Height, Width, Channels. I understand that the first step, feature gathering, involves a dimension reduction, 1x1 convolutions, softmax, and bilinear pooling. Should the data be reshaped to be (B, H, W, CxT)? That seems to be my inclination from the paper -- "where each b is a dhw-dimensional row vector" -- it seems that the output of the gathering stage is dxhxw size, and doesn't incorporate the input channel size because the conv is 1x1x1.
Thoughts?