jeonsworld / ViT-pytorch

Pytorch reimplementation of the Vision Transformer (An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale)
MIT License
1.95k stars 374 forks source link

Not able to load ViT-H_14 #35

Open abhijay9 opened 3 years ago

abhijay9 commented 3 years ago

I was testing using the provided visualize_attention_map.ipynb

The ViT-B_16-224 loads fine but when I downloaded and was loading ViT-H_14, it gave me the following error:

RuntimeError                              Traceback (most recent call last)
<ipython-input-36-0b02f0ab326a> in <module>
      2 config = CONFIGS["ViT-H_14"]
      3 model = VisionTransformer(config, num_classes=1000, zero_head=False, img_size=224, vis=True)
----> 4 model.load_from(np.load("imagenet21k_ViT-H_14.npz"))
      5 model.eval()

~/Documents/clones/ViT-pytorch/models/modeling.py in load_from(self, weights)
    287                 nn.init.zeros_(self.head.bias)
    288             else:
--> 289                 self.head.weight.copy_(np2th(weights["head/kernel"]).t())
    290                 self.head.bias.copy_(np2th(weights["head/bias"]).t())
    291 

RuntimeError: The size of tensor a (1000) must match the size of tensor b (21843) at non-singleton dimension 0

What do you think might be the error?

sajjad2014 commented 2 years ago

Hello, The reason for this error is that you are loading the imagenet21k weights instead of the fine-tuned version, so you need to set num_classes to 21843. Note that the fine-tuned version of the ViT-H model is not listed as one of the available models, therefore if you need a model that is fine-tuned on the Imagenet 1k, you can use one of the models listed on the front page of this repository.