taoyang1122 / adapt-image-models

[ICLR'23] AIM: Adapting Image Models for Efficient Video Action Recognition
Apache License 2.0
278 stars 21 forks source link

why the second dimension is n*b? #21

Closed 007invictus closed 1 year ago

007invictus commented 1 year ago

https://github.com/taoyang1122/adapt-image-models/blob/4da311e4fbe51131190bde64d8f51c2105fc95fd/mmaction/models/backbones/vit_clip.py#L80

taoyang1122 commented 1 year ago

Hi, it is combining the spatial dimension with the batchsize dimension, and do self-attention on the temporal dimension in the following self-attention layer.

007invictus commented 1 year ago

why not (b n) td?

taoyang1122 commented 1 year ago

because the self-attention is applie to the first dimension.

007invictus commented 1 year ago

Thank you for your reply! It turns out I was careless in looking up the API definition.