You are welcome to distributed training. We don't have exact documentation on how to do this, but two starting points:
Keep mathematical equivalence to a single node, you can check your HyperParamters against the references which support gradient accumulation on a single device.
Is distributed training supported in MLperf? If yes, is there any documentation?