Resnet for larger scales

bitfort commented 4 years ago

We are interested in exploring resnet at larger scales. This could include methods to reach 80k batch including different HPs and optimizer changes.

bitfort commented 4 years ago

Proposal to consider:

We propose the following can be tuned as follows: LARS base learning rate : as per MLPERF 0.6 rules LARS weight decay : permissible values of 1e-4 and 2e-4 LARS warmup epochs : as per MLPERF 0.6 rules LARS decay steps: tunable to enable SOTA convergence steps for ResNet-50. The MLPERF reference Resnet-50 shall describe the decay steps for the different batch sizes. TBD: scaling to new batch sizes.

Tune momentum hyper parameter.

The ResNet-50 default uses a momentum of 0.9. As shown in the MLPERF v-0.6 scaling reference, tuning the momentum hyper parameter in the range [0.9-0.95) results in faster training accuracy and convergence to target accuracy at larger batch sizes. The Fujitsu submission in MLPERF 0.6 also used a momentum hyper parameter in that range.

Hyper parameters for batch 64k:

LARS Base LR 29.0 LARS Warmup epochs 18 Total number of steps 2669 corresponds to 68 epochs Momentum 0.93 Weight decay 1e-4

Hyper parameters for batch 32k training in 64 epochs: LARS Base LR 32.2 LARS Warmup epochs 28 Decay steps 1785 corresponds to 90 epochs Momentum 0.94 Weight decay 1e-4

abhinavvishnu commented 4 years ago

For completion purposes, is it possible to put a link to LARS v0.6 HP table for smaller batch sizes as well?

jonathan-cohen-nvidia commented 4 years ago

The description in this issue doesn't quite match the PR: https://github.com/mlperf/training/pull/342/files

We think that the proposal is as follows:

No additional HPs should be tune-able per submission
For batch >= 64k, new proposed values for base learning rate, momentum, warm-up epochs
For batch >= 32k, "
For batch < 32k, "

Unclear: Is the proposal to allow weight decay to be a tuneable HP \in {1e-4, 2e-4}? If proposal is to also allow 2e-4, request is for data/evidence/argument as to how this is better.

petermattson commented 4 years ago

Duplicate of #346 (will be handled by the HP table merge)

mlcommons / training_policies

Resnet for larger scales #272