idstcv / GPU-Efficient-Networks

Apache License 2.0
191 stars 37 forks source link

GENet的精度复现问题 #3

Closed pawopawo closed 4 years ago

pawopawo commented 4 years ago

看到您的GENet,很感兴趣,想复现一下论文的结果,但是发现论文的训练细节不是特别清楚。 我用batch size 1024,lr 0.5,weight decay 1e-4,epochs 360, 5个epochs的 warmup,cosine 学习率衰减,无dropout, GENet-normal结构的精度只训练到了76.1。

想咨询一下GENet-normal结构的训练策略是怎么样的,比如 lr,batch size,weight decay ,dropout rate,epochs,学习率的衰减策略,以及是否用了warm up。盼望得到您的帮助~

MingLin-home commented 4 years ago

We will update our draft this week to include more detailed training parameters. We use cosine lr decay, warm-up 5 epochs, wd is 4e-5, lr=0.1, batch size 256.

pawopawo commented 4 years ago

请问蒸馏对 论文的结果带来了多大的提升?

MingLin-home commented 4 years ago

The main purpose of teacher network is to help the student network escape the bad local minima. There is about 1% accuracy drop without the help of teacher network in the early training stages.

pawopawo commented 4 years ago

The main purpose of teacher network is to help the student network escape the bad local minima. There is about 1% accuracy drop without the help of teacher network in the early training stages.

所以蒸馏对最终精度没影响?只是收敛的更快了?

MingLin-home commented 4 years ago

The main purpose of teacher network is to help the student network escape the bad local minima. There is about 1% accuracy drop without the help of teacher network in the early training stages.

所以蒸馏对最终精度没影响?只是收敛的更快了?

Without teacher network, the training will quickly get stuck around 60 epochs. With teacher network, the accuracy will keep increasing as you train longer. It seems that what teacher network you use is not important, which is wired to us too.