Support for distributed training

mlcommons / training

Reference implementations of MLPerf™ training benchmarks

https://mlcommons.org/en/groups/training

Apache License 2.0

1.59k stars 553 forks source link

Closed b1ueshad0w closed 1 year ago

b1ueshad0w commented 3 years ago

Is distributed training supported in MLperf? If yes, is there any documentation?

bitfort commented 3 years ago

You are welcome to distributed training. We don't have exact documentation on how to do this, but two starting points:

Keep mathematical equivalence to a single node, you can check your HyperParamters against the references which support gradient accumulation on a single device.
Look at existing submissions for how they distributed their workloads https://github.com/mlperf/training_results_v0.7