megvii-research / mdistiller

The official implementation of [CVPR2022] Decoupled Knowledge Distillation https://arxiv.org/abs/2203.08679 and [ICCV2023] DOT: A Distillation-Oriented Trainer https://openaccess.thecvf.com/content/ICCV2023/papers/Zhao_DOT_A_Distillation-Oriented_Trainer_ICCV_2023_paper.pdf
807 stars 123 forks source link

About the implementation of FitNets #29

Open Coinc1dens opened 1 year ago

Coinc1dens commented 1 year ago

Hello, your work on knowledge distillation is great! However, I have some problems about the code of FitNets. I found you just use sum of losses to get backward, specifically, the loss_feat and loss_ce are passed together to the trainer directly. But I think that it is supposed to train initial weights of intermediate layers using feature loss then train the whole student model with ce loss, according to original paper. I wonder if I get something wrong about this or I misunderstand the process? Look forward to ur reply.

Zzzzz1 commented 1 year ago

Thanks for your attention. We check the code and the original paper. FitNets is actually a two-stage distillation method yet our implementation simply combines the feature loss and the logit loss following CRD's codebase. We will correct it when updating the code.