Closed christ1ne closed 4 years ago
What PR are you referring to?
It seems that claims have been made that resnet SGD can converge in 41 epochs. People are interested in how that is possible; are there HPs.
We allowed resnet SGD + poly because of claims it was really good. We would like to see the HPs that make it really good.
AI(NV) see what you can share.
@jonathan-cohen-nvidia https://github.com/mlperf/training/pull/390
Here are HPs that work well for SGD + Poly LR.
LR | Mom | WD | Warmup | Total epochs | Expected convergence | |
---|---|---|---|---|---|---|
BS ~1.6k | 3 | 0.9 | 0.000025 | 3 | 41 | 40 |
BS ~3.2k | 6 | 0.9 | 0.000025 | 5 | 42 | 41 |
In official HP speak:
BS -> global_batch_size LR -> sgd_opt_base_learning_rate Mom -> sgd_opt_momentum WD -> sgd_opt_weight_decay Warmup -> opt_learning_rate_warmup_epochs Total Epochs -> sgd_opt_learning_rate_decay_steps
sgd_opt_learning_rate_decay_poly_power -> Must be set to 2 according to the rules sgd_opt_end_learning_rate -> According to the rules it must be 0.0 or 0.0(...)01, doesn't make any difference
Perfect, thank you @jonathan-cohen-nvidia !
SWG:
Reported to be closable.
During HP borrowing, can we borrow LARS optimizers when you use SGD or vice versa?
Yes. Since that's officially a HP, it is allowed to switch to a different optimizer during HP borrowing. We explicitly discussed this & agreed it should be allowed.
Ok. Let's document this.
Clarify in the rules, if not already there, that Optimizer is an HP which can be borrowed/stolen.
SWG:
Clarify in the rules, if not already there, that Optimizer is an HP which can be borrowed/stolen.
Please share the HPs via the PR 390 @jonathan-cohen-nvidia