mlcommons / training_policies

Issues related to MLPerf™ training policies, including rules and suggested changes
https://mlcommons.org/en/groups/training
Apache License 2.0
93 stars 66 forks source link

Convergence issue with MXNet ResNet-50 using Lars with 4K Batch size #296

Closed rnaidu02 closed 4 years ago

rnaidu02 commented 4 years ago

MXNet Resnet50 Lars with 4K BS cannot converge within 60 epochs, following are the details:

Intel team used MXNet build in mxnet.optimizer.lars and mxnet.lr_scheduler.PolyScheduler, as well as resnet50_v1b model from MXNet GluonCV model zoo. In terms of Hyper Parameters, we used the HPs from Google’s last submission with 4K BS (128 x 32 tpu): https://github.com/mlperf/training_results_v0.6/blob/master/Google/results/tpu-v3-32/resnet/result_0.txt

We want to know if we are missing something to reproduce the same convergence results as NV and Google when using the same LARS hyperparameter configurations for the ResNet-50 model from MXNet GluonCV model zoo.

bitfort commented 4 years ago

There was some discussion on this during the meeting to try to understand the issue. We can follow up off line based on what was discussed.

bitfort commented 4 years ago

Is there still an issue here? Please reach out over email.

bitfort commented 4 years ago

AI(Christine) Follow up over email.

christ1ne commented 4 years ago

we are OK to close.