openai / CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
MIT License
24.55k stars 3.2k forks source link

Dimension Discrepancy in VisionTransformer? #420

Open Shiran-Yuan opened 7 months ago

Shiran-Yuan commented 7 months ago

In line 227 of clip/model.py there is:

x = torch.cat([self.class_embedding.to(x.dtype) + torch.zeros(x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device), x], dim=1) # shape = [*, grid ** 2 + 1, width].

But this seems impossible, because self.class_embedding is a parameter vector, while torch.zeros(x.shape[0], 1, x.shape[-1]) is a order-3 tensor...

Am I missing something here?