Open wlsrick opened 1 year ago
Hi, they are the same. The difference is that the self-attention implementation is different in CLIP model codes and ViT codes. But they are both doing the self-attention on the T dimension. You may check their implementation details.
OK. Got it. Thanks a lot~
Hello, I want to ask the input dimension before multi head attention between vit_clip.py and vit_imagenet.py In vit_clip.py, the input dimension before T-MSA is t (b n) d, but in vit_imagenet.py, the input dimension before T-MSA is (b n) t d. And I see the description of the paper, it is (N+1) x T x D. So I want to ask which one is the correct one? Thanks a lot.