Why prepruning distillation?

princeton-nlp / CoFiPruning

[ACL 2022] Structured Pruning Learns Compact and Accurate Models https://arxiv.org/abs/2204.00408

MIT License

188 stars 32 forks source link

Why prepruning distillation? #56

Open mpiorczynski opened 9 months ago

mpiorczynski commented 9 months ago

Hi, I have a question about the intuition behind the prepruning distillation step. Why are you not initializing the student model from the teacher weights, instead of initializing it from scratch (/pretrained on MLM BERT checkpoint)?

xiamengzhou commented 9 months ago

Yes, I think initializing the teacher weights is completely fine and with the limited exploration we did, empirically they have very similar results :)