lucidrains / vit-pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch
MIT License
20.82k stars 3.07k forks source link

The Total params: and Params size (MB) of the model printed by summary are different from the bit_base model in timm library. Theoretically, the same settings should be the same. What is the reason? #329

Open lucker26 opened 2 months ago

lucker26 commented 2 months ago

import torch from vit import ViT from torchsummary import summary import timm v = ViT( image_size = 224, patch_size = 16, num_classes = 1000, dim = 768, depth = 12, heads = 12, mlp_dim = 3072, dropout = 0.1, emb_dropout = 0.1 )

使用 summary 显示模型的摘要

summary(v, input_size=(3, 224, 224), device='cpu') # 传入输入的形状 (C, H, W)

加载 ViT-B/16 模型

model = timm.create_model('vit_base_patch16_224', pretrained=False)

打印模型的摘要信息

summary(model, input_size=( 3, 224, 224))

lucker26 commented 2 months ago

image The result of the first diagram is the model you wrote. The result of the second graph is timm's. image