Open Coinc1dens opened 1 year ago
Thanks for your attention. We check the code and the original paper. FitNets is actually a two-stage distillation method yet our implementation simply combines the feature loss and the logit loss following CRD's codebase. We will correct it when updating the code.
Hello, your work on knowledge distillation is great! However, I have some problems about the code of FitNets. I found you just use sum of losses to get backward, specifically, the
loss_feat
andloss_ce
are passed together to the trainer directly. But I think that it is supposed to train initial weights of intermediate layers using feature loss then train the whole student model with ce loss, according to original paper. I wonder if I get something wrong about this or I misunderstand the process? Look forward to ur reply.