Open devszr opened 3 years ago
@devszr Most common case we test with is each of the 8 processes using a dedicated GPU. For this case, as long as the GPU context is initialized before UCP, the closest NIC to the GPU should be selected automatically (at least on baremetal as long as sysfs paths accurately represent relative distance between NICs and GPUs). If you're trying to use 8 GPUs from the same process, then it's still possible have the closest NICs be bound to each of the GPUs but done additional binding scripts would likely be necessary -- especially if you're using a single UCP context for all GPUs. Which case are you using and are you seeing suboptimal use of the NICs?
@Akshay-Venkatesh
Thank you for the quick response.
This is something we are testing at a cloud vendor and the instance is a VM. I am not sure the mapping of the topology returned within the OS is correctly representing the physical hardware (something that is being looked into).
As an example, with 8 GPUs and 8 IB HCAs on the VMs (two VMs being used and each VM having 8 GPUs and 8 IB HCAs), if we start an OpenMPI job with 32 processes on each VM, we see that 4 processes use one GPU each on both VMs, however, looking at the IB stats, it seems like UCX's internal selection mechanism somehow uses just one of the IB links because whatever is returned by commands like lstopo is most not what is on the physical machine.
Is it possible to somehow provide UCX with the mapping that we want it to use instead of having it automatically select the IB HCAs to use for each process? Ideally, I suppose, we would want each of the 4 MPI processes on each GPU use one dedicated IB HCA on each machine?, that would utilize all the IB links and all the GPUs and hopefully achieve max throughput?
@devszr In the past we've used a script as follows on DGX-1 which has 8 GPUs and 4 NICs:
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_2 mlx5_3 CPU Affinity NUMA Affinity
GPU0 X NV1 NV1 NV2 NV2 SYS SYS SYS PIX PHB SYS SYS 0-19,40-59 0
GPU1 NV1 X NV2 NV1 SYS NV2 SYS SYS PIX PHB SYS SYS 0-19,40-59 0
GPU2 NV1 NV2 X NV2 SYS SYS NV1 SYS PHB PIX SYS SYS 0-19,40-59 0
GPU3 NV2 NV1 NV2 X SYS SYS SYS NV1 PHB PIX SYS SYS 0-19,40-59 0
GPU4 NV2 SYS SYS SYS X NV1 NV1 NV2 SYS SYS PIX PHB 20-39,60-79 1
GPU5 SYS NV2 SYS SYS NV1 X NV2 NV1 SYS SYS PIX PHB 20-39,60-79 1
GPU6 SYS SYS NV1 SYS NV1 NV2 X NV2 SYS SYS PHB PIX 20-39,60-79 1
GPU7 SYS SYS SYS NV1 NV2 NV1 NV2 X SYS SYS PHB PIX 20-39,60-79 1
mlx5_0 PIX PIX PHB PHB SYS SYS SYS SYS X PHB SYS SYS
mlx5_1 PHB PHB PIX PIX SYS SYS SYS SYS PHB X SYS SYS
mlx5_2 SYS SYS SYS SYS PIX PIX PHB PHB SYS SYS X PHB
mlx5_3 SYS SYS SYS SYS PHB PHB PIX PIX SYS SYS PHB X
$ cat get_local_ompi_rank_hca
#!/bin/bash
export LOCAL_RANK=$OMPI_COMM_WORLD_LOCAL_RANK
if [[ -z "${CUDA_VISIBLE_DEVICES}" ]]; then
index=$((LOCAL_RANK / 2))
else
variable=${CUDA_VISIBLE_DEVICES}
j=0
for i in $(echo $variable | sed "s/,/ /g")
do
cvd[$j]=$i
j=$((j + 1))
done
adj_i=$(( LOCAL_RANK % j ))
adj_cvd=${cvd[$adj_i]}
index=$(( adj_cvd / 2 ))
fi
if (( $index == 0 )); then
export UCX_NET_DEVICES=mlx5_0:1
elif (( $index == 1 )); then
export UCX_NET_DEVICES=mlx5_1:1
elif (( $index == 2 )); then
export UCX_NET_DEVICES=mlx5_2:1
elif (( $index == 3 )); then
export UCX_NET_DEVICES=mlx5_3:1
fi
echo "local rank $LOCAL_RANK: using hca $UCX_NET_DEVICES"
exec $*
$ chmod +x get_local_ompi_rank_hca
$ mpirun -np $np $options get_local_ompi_rank_hca $exec
The above scriptmakes the assumption that each process uses a GPU as dictated by OMPI_COMM_WORLD_LOCAL_RANK (that is the rank within the node). If the process to GPU mapping is different, you'll need to adjust that as well.
The selection logic here is straight forward as the processes which use:
So the general rule of nic_index = (gpu_index / 2) works. You may need to adjust this for your setup.
Is it possible to somehow provide UCX with the mapping that we want it to use instead of having it automatically select the IB HCAs to use for each process? Ideally, I suppose, we would want each of the 4 MPI processes on each GPU use one dedicated IB HCA on each machine?, that would utilize all the IB links and all the GPUs and hopefully achieve max throughput?
The script then goes on to set UCX_NET_DEVICES env var to restrict the set of network devices that UCX can use. Then, having each of the MPI processes pick the NIC that's closest to the GPUs that they use can be enforced and GPU-centric communication can be optimized. That said, this is not guaranteed to optimize overall communication (especially if host communication relies on using all available NICs and using multi-rail) as it also restricts the set of NICs that the MPI ranks can use for any communication. Optimizing all paths is a future goal (>= 1.12 release) for UCX at the moment.
@Akshay-Venkatesh Thank you so much for this explanation. I will try this out, experiment a bit and let you know how it goes.
From my understanding, for all GPU-GPU RDMA ops with new versions of OpenMPI, UCX is the way to go, since that is the one which will be developed and recommended for the foreseeable future.
Looking forward to all the optimizations coming in the future.
@Akshay-Venkatesh This worked. We are now able to see all the 8 IB interfaces being used and we are also able to use all the GPUs! This was great info to get started and we can do some further analysis for fine tuning UCX and try out different settings. Thanks.
@Akshay-Venkatesh @yosefe It would be useful to dump this in FAQ
@Akshay-Venkatesh
We have been doing a few tests with this setup and it looks very encouraging. However, we seem to hit some kind of limit when we try to place 8 processes on each GPU and each IB HCA.
When we try a run with 32 processes on each node and place 4 processes on each of the 8 GPUs and HCAs everything works fine, but, when we increase this to 64 processes on each node, i.e 8 processes on each of the 8 GPUs and HCAs, the application just hangs and doesn't proceed. We also see that the nvidia-smi command becomes extremely slow and hangs and our application, although allocating a lot of memory on each GPU, shows 0% utilization of the GPUs.
Is there some setting which could be deadlocking when trying to use 64 over 32 processes per node?
Thanks.
@devszr I'm guessing you're already using MPS to oversubscribe the GPUs on the system.
The document linked specifies that on Volta architecture, up to 48 clients are supported on each GPU so 8 processes per GPU should be ok. You may be hitting memory limits though. Have you tried to see where the performance cliff starts by trying with 5,6,7 processes per GPU? AFAIK, NIC should be able to easily handle 8 processes. Also have you tried to run a smaller workload for each process while keeping the degree of subscription to 8?
@Akshay-Venkatesh Yes MPS is running on the nodes, but these are A-100 and not V100's. I see now in this document it states CUDA MPS is supported on top of MIG. The only limitation is that the maximum number of clients (48) is lowered proportionally to the Compute Instance size.
Could that be limiting something? However, we did try the same application using the OpenMPI (3.1.5) which comes with the NVIDIA HPC SDK and that worked with 8 processes/GPU. It is not compiled with UCX though.
We'll try out the other tests and see where the limit is hit.
This is not necessarily a bug report, but I am new to the UCX+CUDA world and wanted to get your thoughts on whether it should be possible to use UCX+CUDA for GPU direct operations on two machines which have more than 4 IB HCAs?
Lets say we have two machines which have 8 active IB ports and multiple GPU devices, for example 8, should it be possible to run an MPI CUDA application with OpenMPI-4.x compiled with UCX+CUDA (assuming all the nv_peer* and other GPU direct modules are compiled and loaded) and make it utilize all the 8 IB links to achieve max throughput and also have all GPU to GPU operations go over these IB links?
I read on another post in the issue list about multi-rail and a max of 3 devices can be specified, not sure if that would be related to my question?
Thanks for any tips or insight into expected behaviors.