Question on using UCX+CUDA on machines with more than four IB HCAs and multiple GPUs

devszr commented 3 years ago

This is not necessarily a bug report, but I am new to the UCX+CUDA world and wanted to get your thoughts on whether it should be possible to use UCX+CUDA for GPU direct operations on two machines which have more than 4 IB HCAs?

Lets say we have two machines which have 8 active IB ports and multiple GPU devices, for example 8, should it be possible to run an MPI CUDA application with OpenMPI-4.x compiled with UCX+CUDA (assuming all the nv_peer* and other GPU direct modules are compiled and loaded) and make it utilize all the 8 IB links to achieve max throughput and also have all GPU to GPU operations go over these IB links?

I read on another post in the issue list about multi-rail and a max of 3 devices can be specified, not sure if that would be related to my question?

Thanks for any tips or insight into expected behaviors.

Akshay-Venkatesh commented 3 years ago

@devszr Most common case we test with is each of the 8 processes using a dedicated GPU. For this case, as long as the GPU context is initialized before UCP, the closest NIC to the GPU should be selected automatically (at least on baremetal as long as sysfs paths accurately represent relative distance between NICs and GPUs). If you're trying to use 8 GPUs from the same process, then it's still possible have the closest NICs be bound to each of the GPUs but done additional binding scripts would likely be necessary -- especially if you're using a single UCP context for all GPUs. Which case are you using and are you seeing suboptimal use of the NICs?

devszr commented 3 years ago

@Akshay-Venkatesh

Thank you for the quick response.

This is something we are testing at a cloud vendor and the instance is a VM. I am not sure the mapping of the topology returned within the OS is correctly representing the physical hardware (something that is being looked into).

As an example, with 8 GPUs and 8 IB HCAs on the VMs (two VMs being used and each VM having 8 GPUs and 8 IB HCAs), if we start an OpenMPI job with 32 processes on each VM, we see that 4 processes use one GPU each on both VMs, however, looking at the IB stats, it seems like UCX's internal selection mechanism somehow uses just one of the IB links because whatever is returned by commands like lstopo is most not what is on the physical machine.

Is it possible to somehow provide UCX with the mapping that we want it to use instead of having it automatically select the IB HCAs to use for each process? Ideally, I suppose, we would want each of the 4 MPI processes on each GPU use one dedicated IB HCA on each machine?, that would utilize all the IB links and all the GPUs and hopefully achieve max throughput?

Akshay-Venkatesh commented 3 years ago

@devszr In the past we've used a script as follows on DGX-1 which has 8 GPUs and 4 NICs:


$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  mlx5_3  CPU Affinity    NUMA Affinity
GPU0     X      NV1     NV1     NV2     NV2     SYS     SYS     SYS     PIX     PHB     SYS     SYS     0-19,40-59      0
GPU1    NV1      X      NV2     NV1     SYS     NV2     SYS     SYS     PIX     PHB     SYS     SYS     0-19,40-59      0
GPU2    NV1     NV2      X      NV2     SYS     SYS     NV1     SYS     PHB     PIX     SYS     SYS     0-19,40-59      0
GPU3    NV2     NV1     NV2      X      SYS     SYS     SYS     NV1     PHB     PIX     SYS     SYS     0-19,40-59      0
GPU4    NV2     SYS     SYS     SYS      X      NV1     NV1     NV2     SYS     SYS     PIX     PHB     20-39,60-79     1
GPU5    SYS     NV2     SYS     SYS     NV1      X      NV2     NV1     SYS     SYS     PIX     PHB     20-39,60-79     1
GPU6    SYS     SYS     NV1     SYS     NV1     NV2      X      NV2     SYS     SYS     PHB     PIX     20-39,60-79     1
GPU7    SYS     SYS     SYS     NV1     NV2     NV1     NV2      X      SYS     SYS     PHB     PIX     20-39,60-79     1
mlx5_0  PIX     PIX     PHB     PHB     SYS     SYS     SYS     SYS      X      PHB     SYS     SYS
mlx5_1  PHB     PHB     PIX     PIX     SYS     SYS     SYS     SYS     PHB      X      SYS     SYS
mlx5_2  SYS     SYS     SYS     SYS     PIX     PIX     PHB     PHB     SYS     SYS      X      PHB
mlx5_3  SYS     SYS     SYS     SYS     PHB     PHB     PIX     PIX     SYS     SYS     PHB      X 

$ cat get_local_ompi_rank_hca
#!/bin/bash

export LOCAL_RANK=$OMPI_COMM_WORLD_LOCAL_RANK
if [[ -z "${CUDA_VISIBLE_DEVICES}" ]]; then
    index=$((LOCAL_RANK / 2))
else
    variable=${CUDA_VISIBLE_DEVICES}
    j=0
    for i in $(echo $variable | sed "s/,/ /g")
    do
        cvd[$j]=$i
        j=$((j + 1))
    done
    adj_i=$(( LOCAL_RANK % j ))
    adj_cvd=${cvd[$adj_i]}
    index=$(( adj_cvd / 2 ))
fi

if (( $index == 0 )); then
    export UCX_NET_DEVICES=mlx5_0:1
elif (( $index == 1 )); then
    export UCX_NET_DEVICES=mlx5_1:1
elif (( $index == 2 )); then
    export UCX_NET_DEVICES=mlx5_2:1
elif (( $index == 3 )); then
    export UCX_NET_DEVICES=mlx5_3:1
fi

echo "local rank $LOCAL_RANK: using hca $UCX_NET_DEVICES"
exec $*

$ chmod +x get_local_ompi_rank_hca
$ mpirun -np $np $options get_local_ompi_rank_hca $exec

The above scriptmakes the assumption that each process uses a GPU as dictated by OMPI_COMM_WORLD_LOCAL_RANK (that is the rank within the node). If the process to GPU mapping is different, you'll need to adjust that as well.

The selection logic here is straight forward as the processes which use:

GPU0/GPU1 should use mlx5_0
GPU2/GPU3 should use mlx5_1
GPU4/GPU5 should use mlx5_2
GPU6/GPU7 should use mlx5_3

So the general rule of nic_index = (gpu_index / 2) works. You may need to adjust this for your setup.

Is it possible to somehow provide UCX with the mapping that we want it to use instead of having it automatically select the IB HCAs to use for each process? Ideally, I suppose, we would want each of the 4 MPI processes on each GPU use one dedicated IB HCA on each machine?, that would utilize all the IB links and all the GPUs and hopefully achieve max throughput?

The script then goes on to set UCX_NET_DEVICES env var to restrict the set of network devices that UCX can use. Then, having each of the MPI processes pick the NIC that's closest to the GPUs that they use can be enforced and GPU-centric communication can be optimized. That said, this is not guaranteed to optimize overall communication (especially if host communication relies on using all available NICs and using multi-rail) as it also restricts the set of NICs that the MPI ranks can use for any communication. Optimizing all paths is a future goal (>= 1.12 release) for UCX at the moment.

devszr commented 3 years ago

@Akshay-Venkatesh Thank you so much for this explanation. I will try this out, experiment a bit and let you know how it goes.

From my understanding, for all GPU-GPU RDMA ops with new versions of OpenMPI, UCX is the way to go, since that is the one which will be developed and recommended for the foreseeable future.

Looking forward to all the optimizations coming in the future.

devszr commented 3 years ago

@Akshay-Venkatesh This worked. We are now able to see all the 8 IB interfaces being used and we are also able to use all the GPUs! This was great info to get started and we can do some further analysis for fine tuning UCX and try out different settings. Thanks.

shamisp commented 3 years ago

@Akshay-Venkatesh @yosefe It would be useful to dump this in FAQ

devszr commented 3 years ago

@Akshay-Venkatesh

We have been doing a few tests with this setup and it looks very encouraging. However, we seem to hit some kind of limit when we try to place 8 processes on each GPU and each IB HCA.

When we try a run with 32 processes on each node and place 4 processes on each of the 8 GPUs and HCAs everything works fine, but, when we increase this to 64 processes on each node, i.e 8 processes on each of the 8 GPUs and HCAs, the application just hangs and doesn't proceed. We also see that the nvidia-smi command becomes extremely slow and hangs and our application, although allocating a lot of memory on each GPU, shows 0% utilization of the GPUs.

Is there some setting which could be deadlocking when trying to use 64 over 32 processes per node?

Thanks.

Akshay-Venkatesh commented 3 years ago

@devszr I'm guessing you're already using MPS to oversubscribe the GPUs on the system.

The document linked specifies that on Volta architecture, up to 48 clients are supported on each GPU so 8 processes per GPU should be ok. You may be hitting memory limits though. Have you tried to see where the performance cliff starts by trying with 5,6,7 processes per GPU? AFAIK, NIC should be able to easily handle 8 processes. Also have you tried to run a smaller workload for each process while keeping the degree of subscription to 8?

devszr commented 3 years ago

@Akshay-Venkatesh Yes MPS is running on the nodes, but these are A-100 and not V100's. I see now in this document it states CUDA MPS is supported on top of MIG. The only limitation is that the maximum number of clients (48) is lowered proportionally to the Compute Instance size.

Could that be limiting something? However, we did try the same application using the OpenMPI (3.1.5) which comes with the NVIDIA HPC SDK and that worked with 8 processes/GPU. It is not compiled with UCX though.

We'll try out the other tests and see where the limit is hit.

openucx / ucx

Question on using UCX+CUDA on machines with more than four IB HCAs and multiple GPUs #7273