Closed 007invictus closed 1 year ago
Hi, it is combining the spatial dimension with the batchsize dimension, and do self-attention on the temporal dimension in the following self-attention layer.
why not (b n) td?
because the self-attention is applie to the first dimension.
Thank you for your reply! It turns out I was careless in looking up the API definition.
https://github.com/taoyang1122/adapt-image-models/blob/4da311e4fbe51131190bde64d8f51c2105fc95fd/mmaction/models/backbones/vit_clip.py#L80