linux-rdma / perftest

Infiniband Verbs Performance Tests
Other
533 stars 274 forks source link

couldn't allocate MR while test GDR with cuda. #269

Open derekwin opened 1 week ago

derekwin commented 1 week ago

system info

ubuntu 2204
kernel : 6.5.0-28-generic

nvidia driver and cuda version:

Driver Version: 555.42.02
CUDA Version: 12.5

I install RDMA ofed driver before installing cuda driver and cuda toolkits.

peermem module status:

nvidia_peermem         16384  0
nvidia_uvm           4943872  0
nvidia_drm            122880  0
nvidia_modeset       1368064  1 nvidia_drm
nvidia              54566912  3 nvidia_uvm,nvidia_peermem,nvidia_modeset
video                  73728  1 nvidia_modeset
ib_core               557056  9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
drm_kms_helper        274432  4 ast,nvidia_drm
drm                   765952  6 drm_kms_helper,ast,drm_shmem_helper,nvidia,nvidia_drm

error occured:
./ib_send_bw --use_cuda=0

Perftest doesn't supports CUDA tests with inline messages: inline size set to 0

************************************
* Waiting for client to connect... *
************************************
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 1B:00
CUDA device 1: PCIe address is 3E:00
CUDA device 2: PCIe address is 89:00
CUDA device 3: PCIe address is B2:00

Picking device No. 0
[pid = 3164333, dev = 0] device name = [NVIDIA GeForce RTX 4090]
creating CUDA Ctx
making it the current CUDA Ctx
CUDA device integrated: 0
cuMemAlloc() of a 131072 bytes GPU buffer
allocated GPU buffer address at 00007c43eac00000 pointer=0x7c43eac00000
Couldn't allocate MR
failed to create mr
Failed to create MR
 Couldn't create IB resources
destroying current CUDA Ctx

./ib_send_bw --use_cuda=0 192.168.2.244

Perftest doesn't supports CUDA tests with inline messages: inline size set to 0
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 1B:00
CUDA device 1: PCIe address is 3E:00
CUDA device 2: PCIe address is 89:00
CUDA device 3: PCIe address is B2:00

Picking device No. 0
[pid = 3164350, dev = 0] device name = [NVIDIA GeForce RTX 4090]
creating CUDA Ctx
making it the current CUDA Ctx
CUDA device integrated: 0
cuMemAlloc() of a 131072 bytes GPU buffer
allocated GPU buffer address at 00007847fac00000 pointer=0x7847fac00000
Couldn't allocate MR
failed to create mr
Failed to create MR
 Couldn't create IB resources
destroying current CUDA Ctx
derekwin commented 6 days ago

sry that i didn't notice this suggestion.

  1. If GPUDirect is not working, (e.g. you see "Couldn't allocate MR" error message), consider disabling Scatter to CQE feature. Set the environmental variable MLX5_SCATTER_TO_CQE=0. E.g.: MLX5_SCATTER_TO_CQE=0 ./ib_write_bw -d ib_dev --use_cuda= -a
derekwin commented 6 days ago

after setting MLX5_SCATTER_TO_CQE=0, the problem still exist.