lukemelas / PyTorch-Pretrained-ViT

Vision Transformer (ViT) in PyTorch
770 stars 124 forks source link

Qs about your code compared with the orignal code. #15

Open elk-april opened 3 years ago

elk-april commented 3 years ago

Hi, I noticed that it: your code:

x = self.positional_embedding(x)  # b,gh*gw+1,d 
x = self.transformer(x)  # b,gh*gw+1,d

Vision Transformer(from https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit.py):

x += self.pos_embedding[:, :(n + 1)]
x = self.dropout(x)
x = self.transformer(x)

Actually, there are two differences:

  1. you don't use the dropout after positional_embedding
  2. the original positional_embedding is not used in the classification token

Could you please tell me the reasons for these changes? Looking forward to your reply, thanks very much.