Closed secretu closed 1 year ago
Hi @secretu, thanks for your interest in our work. We use DeiT without distillation since the series of models is widely used in many other architecture papers (e.g., Swin, PVT), while the version with distillation might not be frequently used for comparisons. The results should be consistent if you switch the model to a more powerful version.
Hi, I noticed that the Deit Models used in your code are the version that pretrained with no distillation, getting acc1 at 79.8%. But why not use the version that pretrained with distillation and distill token ,which get acc1 at 81.2% .