Problem using GPU direct RDMA-reg.

srikrrish commented 4 years ago

Hi,

I am having problems using GPU direct RDMA in this simple MPI + CUDA example and these are the modules I am using

gcc/8.4.0, cuda/11.1.0, openmpi/4.0.5 and boost/1.73.0

 #include <boost/mpi/environment.hpp>
 #include <boost/mpi/communicator.hpp>
 #include <vector>
 #include <numeric>
 #include <cuda.h>
 #include <cuda_runtime.h>
 int main(int argc, char* argv[])
 {
     boost::mpi::environment env(argc, argv);
     boost::mpi::communicator world;
     int N = 100;
     int size = N * sizeof(float);
     std::vector<float> host(N);
     int rank = 0;
     int nProcs = 0;

     MPI_Comm_rank(MPI_COMM_WORLD, &rank);
     MPI_Comm_size(MPI_COMM_WORLD, &nProcs);

     //cudaSetDevice(rank);
     float* device;
     cudaMalloc(&device, size);

     int id = -1;
     auto err = cudaGetDevice(&id);
     if (err != cudaSuccess) printf("kernel cuda error: %d, %s\n", (int)err, cudaGetErrorString(err));
     std::cout << "Rank " << rank << " has device " << id << "\n";

     if (rank == 0) {
         //cudaSetDevice(rank);
         std::fill(host.begin(), host.end(), 1);
         cudaMemcpy(device, host.data(), size, cudaMemcpyHostToDevice);
         for (int i=1; i<nProcs; ++i)
            world.send(i, 100, device, N);
     }
     else {
         //cudaSetDevice(rank);
         world.recv(0, 100, device, N);
         cudaMemcpy(host.data(), device, size, cudaMemcpyDeviceToHost);
         float sum = std::accumulate(host.begin(), host.end(), 0);
         std::cout << "Exact: " << N << ", Computed: " << sum  << ", Rank: " << rank << std::endl;
     }
     cudaFree(device);
     MPI_Finalize();
     return 0;
 }

This runs fine in single node but in multi-node case I get the following error

[1606396161.163639] [merlin-g-001:1809 :0]          ib_md.c:325  UCX  ERROR ibv_reg_mr(address=0x2b2d12a00000, length=400, access=0xf)         failed: Bad address
 [1606396161.163660] [merlin-g-001:1809 :0]         ucp_mm.c:137  UCX  ERROR failed to register address 0x2b2d12a00000 mem_type bit 0x2 length  400 on md[7]=mlx5_0: Input/output error (md reg_mem_types 0x17)
 [1606396161.163665] [merlin-g-001:1809 :0]    ucp_request.c:269  UCX  ERROR failed to register user buffer datatype 0x20 address               0x2b2d12a00000 len 400: Input/output error

I came across this issue https://github.com/openucx/ucx/issues/4707 and as per one of the suggestions by setting the environment variable UCX_IB_GPU_DIRECT_RDMA=no the program runs fine.

I am using GTX1080 NVIDIA gpus, is there some suggestion on how I could properly make the GPU direct RDMA work for this. Your help is much appreciated.

bureddy commented 4 years ago

@srikrrish can you confirm if you installed GPUDirectRDMA plugin? (https://www.mellanox.com/products/GPUDirect-RDMA)

mcaubet commented 4 years ago

Hi @bureddy , this is what is installed in the GPU systems:

# ofed_info -s
MLNX_OFED_LINUX-5.0-2.1.8.0:

# lsmod | grep 'nv_peer_mem\|gdrdrv'
gdrdrv                 18460  0 
nv_peer_mem            13163  0 
nvidia              27605367  835 nv_peer_mem,gdrdrv,nvidia_modeset,nvidia_uvm
ib_core               379728  11 rdma_cm,ib_cm,iw_cm,nv_peer_mem,mlx4_ib,mlx5_ib,ib_ucm,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib

# modinfo nv_peer_mem
filename:       /lib/modules/3.10.0-1062.18.1.el7.x86_64/extra/nv_peer_mem.ko
version:        1.1-0
license:        Dual BSD/GPL
description:    NVIDIA GPU memory plug-in
author:         Yishai Hadas
retpoline:      Y
rhelversion:    7.7
srcversion:     2B6F943DAF5A7B8C989DE56
depends:        ib_core,nvidia
vermagic:       3.10.0-1062.18.1.el7.x86_64 SMP mod_unload modversions 

# modinfo gdrdrv
filename:       /lib/modules/3.10.0-1062.18.1.el7.x86_64/kernel/drivers/misc/gdrdrv.ko
version:        2.1
description:    GDRCopy kernel-mode driver
license:        MIT
author:         drossetti@nvidia.com
retpoline:      Y
rhelversion:    7.7
srcversion:     7D2CAF28B1ADA156229B27C
depends:        nv-p2p-dummy
vermagic:       3.10.0-1062.18.1.el7.x86_64 SMP mod_unload modversions 
parm:           dbg_enabled:enable debug tracing (int)
parm:           info_enabled:enable info tracing (int)

# lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch
Distributor ID: RedHatEnterpriseServer
Description:    Red Hat Enterprise Linux Server release 7.7 (Maipo)
Release:    7.7
Codename:   Maipo

# rpm -qa | grep kmod-nvidia-latest-dkms
kmod-nvidia-latest-dkms-455.23.05-1.el7.x86_64

# rpm -qa | grep ucx
ucx-cuda-1.8.0-1.50218.x86_64
ucx-1.8.0-1.50218.x86_64
ucx-ib-1.8.0-1.50218.x86_64

bureddy commented 4 years ago

@srikrrish is it possible to give it a try with ucx-1.9.x?

mcaubet commented 4 years ago

@srikrrish @bureddy I will try to configure one of such GPUs with UCX-1.9.x. I'll be back to you ASAP.

Cheers, Marc

srikrrish commented 4 years ago

@bureddy Thanks a lot for your help @mcaubet is the one who installs and maintains the software stacks in our clusters. That's why I requested him to directly comment in these threads.

mcaubet commented 3 years ago

@bureddy after upgrading 2 GPU systems to UCX1.9 (I also performed an upgrade for OFED from 5.0 to 5.1 which includes UCX1.9), it works. Was weird that with 2 systems with Quadro cards this problem was not seen, while with GTX1080 this problem was showing up even with a simple ucx_perftest (with cuda).

In any case, I will upgrade all nodes with UCX 1.9.

openucx / ucx

Problem using GPU direct RDMA-reg. #5960