raoyongming / DynamicViT

[NeurIPS 2021] [T-PAMI] DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification
https://dynamicvit.ivg-research.xyz/
MIT License
551 stars 69 forks source link

About distill #12

Closed hegc closed 2 years ago

hegc commented 2 years ago

Why LVViT_Teacher return aux_head(x[:, 1:]) instead of tokens?

LVViT_Teacher:

x = self.norm(x)
x_cls = self.head(x[:,0])
x_aux = self.aux_head(x[:,1:])
return x_cls, x_aux

VisionTransformerTeacher:

feature = self.norm(x)
cls = feature[:, 0]
tokens = feature[:, 1:]
cls = self.pre_logits(cls)
cls = self.head(cls)
return cls, tokens

And, can I use the LVViT_Teacher to distill deit_small?

raoyongming commented 2 years ago

Since the LVViT model outputs the classification result of each location, we use the scores (i.e., self.aux_head(x[:,1:])) to compute the KL-div between teacher and student models. For LVViT models, we find the KL-div loss is better than MSE between local features.

Different from distillation methods that focus on transferring the knowledge from stronger models to weaker models, our method is designed to maintain the models' performance after acceleration. It is natural to use the original model to supervised the sparsified model via feature distillation, thus we didn't use stronger teachers in our models. I think it may be effective to use LVViT_Teacher to further improve the performance, but the comparisons may be unfair.

hegc commented 2 years ago

OK, thanks for your quick reply.