Open srikrrish opened 4 years ago
@srikrrish can you confirm if you installed GPUDirectRDMA plugin? (https://www.mellanox.com/products/GPUDirect-RDMA)
Hi @bureddy , this is what is installed in the GPU systems:
# ofed_info -s
MLNX_OFED_LINUX-5.0-2.1.8.0:
# lsmod | grep 'nv_peer_mem\|gdrdrv'
gdrdrv 18460 0
nv_peer_mem 13163 0
nvidia 27605367 835 nv_peer_mem,gdrdrv,nvidia_modeset,nvidia_uvm
ib_core 379728 11 rdma_cm,ib_cm,iw_cm,nv_peer_mem,mlx4_ib,mlx5_ib,ib_ucm,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib
# modinfo nv_peer_mem
filename: /lib/modules/3.10.0-1062.18.1.el7.x86_64/extra/nv_peer_mem.ko
version: 1.1-0
license: Dual BSD/GPL
description: NVIDIA GPU memory plug-in
author: Yishai Hadas
retpoline: Y
rhelversion: 7.7
srcversion: 2B6F943DAF5A7B8C989DE56
depends: ib_core,nvidia
vermagic: 3.10.0-1062.18.1.el7.x86_64 SMP mod_unload modversions
# modinfo gdrdrv
filename: /lib/modules/3.10.0-1062.18.1.el7.x86_64/kernel/drivers/misc/gdrdrv.ko
version: 2.1
description: GDRCopy kernel-mode driver
license: MIT
author: drossetti@nvidia.com
retpoline: Y
rhelversion: 7.7
srcversion: 7D2CAF28B1ADA156229B27C
depends: nv-p2p-dummy
vermagic: 3.10.0-1062.18.1.el7.x86_64 SMP mod_unload modversions
parm: dbg_enabled:enable debug tracing (int)
parm: info_enabled:enable info tracing (int)
# lsb_release -a
LSB Version: :core-4.1-amd64:core-4.1-noarch
Distributor ID: RedHatEnterpriseServer
Description: Red Hat Enterprise Linux Server release 7.7 (Maipo)
Release: 7.7
Codename: Maipo
# rpm -qa | grep kmod-nvidia-latest-dkms
kmod-nvidia-latest-dkms-455.23.05-1.el7.x86_64
# rpm -qa | grep ucx
ucx-cuda-1.8.0-1.50218.x86_64
ucx-1.8.0-1.50218.x86_64
ucx-ib-1.8.0-1.50218.x86_64
@srikrrish is it possible to give it a try with ucx-1.9.x?
@srikrrish @bureddy I will try to configure one of such GPUs with UCX-1.9.x. I'll be back to you ASAP.
Cheers, Marc
@bureddy Thanks a lot for your help @mcaubet is the one who installs and maintains the software stacks in our clusters. That's why I requested him to directly comment in these threads.
@bureddy after upgrading 2 GPU systems to UCX1.9 (I also performed an upgrade for OFED from 5.0 to 5.1 which includes UCX1.9), it works. Was weird that with 2 systems with Quadro cards this problem was not seen, while with GTX1080 this problem was showing up even with a simple ucx_perftest (with cuda).
In any case, I will upgrade all nodes with UCX 1.9.
Hi,
I am having problems using GPU direct RDMA in this simple MPI + CUDA example and these are the modules I am using
gcc/8.4.0, cuda/11.1.0, openmpi/4.0.5 and boost/1.73.0
This runs fine in single node but in multi-node case I get the following error
I came across this issue https://github.com/openucx/ucx/issues/4707 and as per one of the suggestions by setting the environment variable
UCX_IB_GPU_DIRECT_RDMA=no
the program runs fine.I am using GTX1080 NVIDIA gpus, is there some suggestion on how I could properly make the GPU direct RDMA work for this. Your help is much appreciated.