Closed wqshmzh closed 1 month ago
Hi @wqshmzh,
The distillation loss is $\alpha \mathbb{E}_{(x,z)\sim\mathcal{M}}[||z - \pi (x)||^2_2]$ of Equation 3 in our paper, which corresponds to L153-L160. The current model (i.e., learning the current task) is trained with the distillation loss such that the generated logits get close to ones generated from the previous models (i.e., learning the previous tasks).
OK. Got it. Sorry my bad !
I notice that the distillation loss in your code is not mentioned in your paper. So what are they used for ?