Using RDMA capable nodes

msalvaris commented 5 years ago

Is there a reason for using Standard_NC24s_v3 rather than the RDMA capable Standard_NC24rs_v3?

usuyama commented 4 years ago

I also noticed NCCL_IB_DISABLE (env variable) is set to 1 by the pretrain AML environment (or maybe by the Docker image)

NCCL_IB_DISABLE
The NCCL_IB_DISABLE variable disables the IB/RoCE transport that is to be used by NCCL. Instead, NCCL will fallback to using IP sockets.

https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/env.html

Wonder if the authors hit any blocking issues using infiniband/rdma @aashna

usuyama commented 4 years ago

When I tried the pretraining on ND24rs (RDMA/infiniband), I got the following error:

RuntimeError: NCCL error in: ... /torch/lib/c10d/ProcessGroupNCCL.cpp:290, unhandled system error

I think NCCL_IB_DISABLE should be set to 0 (or unset), but haven't tried yet.

usuyama commented 4 years ago

After checking with AzureML folks, it turned out I have to use Intel MPI as the backend when I use nodes without SR-IOV support.

SR-IOV stands for “single root input/output virtualization” which optimizes sharing of PCI Express devices in a system with virtual machines. In Azure, SR-IOV for InfiniBand enables near bare-metal performance for any MPI library.

Accelerating Distributed Training in Azure Machine Learning service using SR-IOV

If you have access to NCv3 or NDv2, then you can take advantage of the faster GPU interconnect. SR-IOV support should come to NCv2 and NDv1 later in 2020.

Without SR-IOV, for NCCL, we need to set "NCCL_IB_DISABLE": "0" to disable InfiniBand on RDMA capable VMs (e.g., ND24rs).

microsoft / AzureML-BERT

Using RDMA capable nodes #34