The official implementation of [CVPR2022] Decoupled Knowledge Distillation https://arxiv.org/abs/2203.08679 and [ICCV2023] DOT: A Distillation-Oriented Trainer https://openaccess.thecvf.com/content/ICCV2023/papers/Zhao_DOT_A_Distillation-Oriented_Trainer_ICCV_2023_paper.pdf
I think the nckd loss is kl loss between teacher and student prediction among nckd output probability, whose shape should be (n,c-1),like Algorithm 1, the Pseudo code of DKD in your paper.
But, why you compute like this , is this equivalent? Could you give a further explanation, thanks!
I think the nckd loss is kl loss between teacher and student prediction among nckd output probability, whose shape should be (n,c-1),like Algorithm 1, the Pseudo code of DKD in your paper. But, why you compute like this , is this equivalent? Could you give a further explanation, thanks!