Open msalvaris opened 5 years ago
I also noticed NCCL_IB_DISABLE (env variable) is set to 1 by the pretrain AML environment (or maybe by the Docker image)
NCCL_IB_DISABLE
The NCCL_IB_DISABLE variable disables the IB/RoCE transport that is to be used by NCCL. Instead, NCCL will fallback to using IP sockets.
https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/env.html
Wonder if the authors hit any blocking issues using infiniband/rdma @aashna
When I tried the pretraining on ND24rs (RDMA/infiniband), I got the following error:
RuntimeError: NCCL error in: ... /torch/lib/c10d/ProcessGroupNCCL.cpp:290, unhandled system error
I think NCCL_IB_DISABLE should be set to 0 (or unset), but haven't tried yet.
After checking with AzureML folks, it turned out I have to use Intel MPI as the backend when I use nodes without SR-IOV support.
SR-IOV stands for “single root input/output virtualization” which optimizes sharing of PCI Express devices in a system with virtual machines. In Azure, SR-IOV for InfiniBand enables near bare-metal performance for any MPI library.
Accelerating Distributed Training in Azure Machine Learning service using SR-IOV
If you have access to NCv3 or NDv2, then you can take advantage of the faster GPU interconnect. SR-IOV support should come to NCv2 and NDv1 later in 2020.
Without SR-IOV, for NCCL, we need to set "NCCL_IB_DISABLE": "0" to disable InfiniBand on RDMA capable VMs (e.g., ND24rs).
Is there a reason for using Standard_NC24s_v3 rather than the RDMA capable Standard_NC24rs_v3?