Closed sbwww closed 1 year ago
Hi,
Yes, the distillation layer matching process is top-down; we first match the 12-th teacher layer to a student layer and then match the 9-th teacher layer, etc. Therefore, in the constrained version, we only allow matching a teacher layer to a lower student layer than the previously matched ones. Thanks for pointing out; we will find a chance to make it more clear in our updated version.
Hi, I just noticed a confusing description of the distillation constraint. Intuitively, I (and probably many other readers) would imagine the distillation from bottom to top, i.e., from layer 1 to layer 12. And to tackle layer mismatching, it is likely that we need higher student layer matched with higher teacher layer. Thus, it is weird to see the constraint as "lower than the previous matched layer".
After reading the code trainer.py line 601, I know the distillation is top-down, so the constraint is "lower than the previous matched layer", but I think the distillation direction needs to be clarified.