taoyang1122 / adapt-image-models

[ICLR'23] AIM: Adapting Image Models for Efficient Video Action Recognition
Apache License 2.0
278 stars 21 forks source link

The dimension of vit_clip and vit_imagenet #38

Open wlsrick opened 1 year ago

wlsrick commented 1 year ago

Hello, I want to ask the input dimension before multi head attention between vit_clip.py and vit_imagenet.py In vit_clip.py, the input dimension before T-MSA is t (b n) d, but in vit_imagenet.py, the input dimension before T-MSA is (b n) t d. And I see the description of the paper, it is (N+1) x T x D. So I want to ask which one is the correct one? Thanks a lot.

taoyang1122 commented 1 year ago

Hi, they are the same. The difference is that the self-attention implementation is different in CLIP model codes and ViT codes. But they are both doing the self-attention on the T dimension. You may check their implementation details.

wlsrick commented 1 year ago

OK. Got it. Thanks a lot~