Questions regarding training and distillation

seokg commented 4 years ago

Hi, after going through the codes I have come up with few questions regarding the training and distillation.

After training the full or teacher model, do we directly train the supernet(student) from scratch using the resnet_supernet.py?
if not, do we first have to train the student mobilenet (with normal seperable conv) using the resnet_distiller.py and transfer the weight to the student supernet?
From looking at the load_networks function in resnet_distiller.py, is it necessary to transfer the weight of the teacher network to the student network? or is it just for faster training and convergence?
Lastly, how long did it took to train the supernet using the distillation?

seokg commented 4 years ago

I have read the Appendix 6.1 on the paper, and the authors have provided the complete pipeline of the training.

train a MobileNet style network from scratch (teacher network)
distill a smaller student network using the network train in the previous stage
initialized by the student network, train ta "once-for-all" network
evaluate all sub-network under a certain computation budget
choose the best-performing sub-network within the previous step and fine-tune the compressed model for the final model.

For the second stage, I am guessing the author has transferred the weight of the teacher to the student network using load_networks.

Finally, the authors provided Table 5. for training details such as training epochs (for training from scratch, distillation and fine-tuning) and once-for-all network training. For pix2pix and cycle gan they have doubled the epoch for once-for-all network compared to the training/distillation/fine-tuning.

I guess this answers all the questions I had. Please correct me if I got something wrong.

lmxyy commented 4 years ago

Yes, you're correct. The once-for-all network training will take no more than 2 days on a single 1080Ti.

mit-han-lab / gan-compression

Questions regarding training and distillation #8