Maybe confusing description of the distillation constraint

princeton-nlp / CoFiPruning

[ACL 2022] Structured Pruning Learns Compact and Accurate Models https://arxiv.org/abs/2204.00408

MIT License

188 stars 32 forks source link

Hi, I just noticed a confusing description of the distillation constraint. Intuitively, I (and probably many other readers) would imagine the distillation from bottom to top, i.e., from layer 1 to layer 12. And to tackle layer mismatching, it is likely that we need higher student layer matched with higher teacher layer. Thus, it is weird to see the constraint as "lower than the previous matched layer".

After reading the code trainer.py line 601, I know the distillation is top-down, so the constraint is "lower than the previous matched layer", but I think the distillation direction needs to be clarified.

for search_index in range(3, -1, -1):

princeton-nlp / CoFiPruning

Maybe confusing description of the distillation constraint #26