Closed slawek-ib closed 1 year ago
Hi,
That's a good point! We repeat the process by fine-tuning the untuned student model from scratch for 1 epoch before starting to prune it. Conceptually, this is similar to initializing a student model with a teacher model. We explored fine-tuning 1/2/3 epochs before pruning, and the results are similar.
Hi, I am closing this issue. Feel free to reopen it if you have more questions :)
Hi, thanks for your great work on this project!
I'm curious why the student model starts from an untuned model rather than from the weights of the teacher? It would seem that reusing it could make the training faster. Is that something you've explored?