Optimizer is an hp (formerly: resnet: 41 epoch with SGD w/ polynomial learning rate decay.)

mlcommons / training_policies

Issues related to MLPerf™ training policies, including rules and suggested changes

https://mlcommons.org/en/groups/training

Apache License 2.0

92 stars 66 forks source link

Optimizer is an hp (formerly: resnet: 41 epoch with SGD w/ polynomial learning rate decay.) #343

Closed christ1ne closed 4 years ago

christ1ne commented 4 years ago

Please share the HPs via the PR 390 @jonathan-cohen-nvidia

jonathan-cohen-nvidia commented 4 years ago

What PR are you referring to?

bitfort commented 4 years ago

It seems that claims have been made that resnet SGD can converge in 41 epochs. People are interested in how that is possible; are there HPs.

We allowed resnet SGD + poly because of claims it was really good. We would like to see the HPs that make it really good.

AI(NV) see what you can share.

wei-v-wang commented 4 years ago

@jonathan-cohen-nvidia https://github.com/mlperf/training/pull/390

jonathan-cohen-nvidia commented 4 years ago

Here are HPs that work well for SGD + Poly LR.

	LR	Mom	WD	Warmup	Total epochs	Expected convergence
BS ~1.6k	3	0.9	0.000025	3	41	40
BS ~3.2k	6	0.9	0.000025	5	42	41

In official HP speak:

BS -> global_batch_size LR -> sgd_opt_base_learning_rate Mom -> sgd_opt_momentum WD -> sgd_opt_weight_decay Warmup -> opt_learning_rate_warmup_epochs Total Epochs -> sgd_opt_learning_rate_decay_steps

sgd_opt_learning_rate_decay_poly_power -> Must be set to 2 according to the rules sgd_opt_end_learning_rate -> According to the rules it must be 0.0 or 0.0(...)01, doesn't make any difference

wei-v-wang commented 4 years ago

Perfect, thank you @jonathan-cohen-nvidia !

bitfort commented 4 years ago

SWG:

Reported to be closable.

christ1ne commented 4 years ago

During HP borrowing, can we borrow LARS optimizers when you use SGD or vice versa?

jonathan-cohen-nvidia commented 4 years ago

Yes. Since that's officially a HP, it is allowed to switch to a different optimizer during HP borrowing. We explicitly discussed this & agreed it should be allowed.

christ1ne commented 4 years ago

Ok. Let's document this.

bitfort commented 4 years ago

Clarify in the rules, if not already there, that Optimizer is an HP which can be borrowed/stolen.

bitfort commented 4 years ago

SWG:

Clarify in the rules, if not already there, that Optimizer is an HP which can be borrowed/stolen.