Closed bitfort closed 4 years ago
Proposal to consider:
We propose the following can be tuned as follows: LARS base learning rate : as per MLPERF 0.6 rules LARS weight decay : permissible values of 1e-4 and 2e-4 LARS warmup epochs : as per MLPERF 0.6 rules LARS decay steps: tunable to enable SOTA convergence steps for ResNet-50. The MLPERF reference Resnet-50 shall describe the decay steps for the different batch sizes. TBD: scaling to new batch sizes.
Tune momentum hyper parameter.
The ResNet-50 default uses a momentum of 0.9. As shown in the MLPERF v-0.6 scaling reference, tuning the momentum hyper parameter in the range [0.9-0.95) results in faster training accuracy and convergence to target accuracy at larger batch sizes. The Fujitsu submission in MLPERF 0.6 also used a momentum hyper parameter in that range.
Hyper parameters for batch 64k:
LARS Base LR 29.0 LARS Warmup epochs 18 Total number of steps 2669 corresponds to 68 epochs Momentum 0.93 Weight decay 1e-4
Hyper parameters for batch 32k training in 64 epochs: LARS Base LR 32.2 LARS Warmup epochs 28 Decay steps 1785 corresponds to 90 epochs Momentum 0.94 Weight decay 1e-4
For completion purposes, is it possible to put a link to LARS v0.6 HP table for smaller batch sizes as well?
The description in this issue doesn't quite match the PR: https://github.com/mlperf/training/pull/342/files
We think that the proposal is as follows:
Unclear: Is the proposal to allow weight decay to be a tuneable HP \in {1e-4, 2e-4}? If proposal is to also allow 2e-4, request is for data/evidence/argument as to how this is better.
Duplicate of #346 (will be handled by the HP table merge)
We are interested in exploring resnet at larger scales. This could include methods to reach 80k batch including different HPs and optimizer changes.