IBV_WC_LOC_PROT_ERR (4) and IBV_WC_REM_OP_ERR (11) when using GPU

lizraymond commented 1 year ago

Hi all,

I am trying to run rdma perftest (specifically ib_write_bw & ib_read_bw) between two identical nodes but continue encountering errors. The client dies, but I have to manually cancel the server.

Error status 4 (IBV_WC_LOC_PROT_ERR (4) - Local Protection Error: the locally posted Work Request’s buffers in the scatter/gather list does not reference a Memory Region that is valid for the requested operation.) whenever I am using ib_write_bw from client GPU, to server side with GPU or not.
Error status 11 _(IBV_WC_REM_OP_ERR (11) - Remote Operation Error: the operation could not be completed successfully by the responder. Possible causes include a responder QP related error that prevented the responder from completing the request or a malformed WQE on the Receive Queue. Relevant for RC QPs.) whenever I try to read from client GPU out of server GPU.
Write/Read by client GPU out of server with no GPU (or server GPU but client no GPU) works but at 20% the expected speed.
Write/Read from client to server (no GPUs involved) both run at 60% the expected speed.

Server configuration:

Ubuntu 22.04
H100 or A100
Mellanox NDR400
NVIDIA Driver 525.85.12
CUDA 12.0.1
MLNX_OFED_LINUX-5.8-1.1.2.1

I am using GPU0 & mlx5_0, which are on the same PCIe switch and therefore the same numa node.

I found the issues where someone fixed their problem with MMIO, but I am using a baremetal platform and I am root user; the process should be able to access literally anything it likes. I also set ulimit max memory size to unlimited. I also found the issue where iommu is at fault, but I have disabled iommu in the grub command line as well as turned off all SRIOV & virtualized functions in the system BIOS, and none of it helped.

rping & ibv_rc_pingpong work fine, and ping can hit the IPoIB addresses with no issue. MTU is set to the max of 4096 by the SM, which supports IPoIB.

Mellanox Info:

ibstat
CA 'mlx5_0'
        CA type: MT4129
        Number of ports: 1
        Firmware version: 28.35.1012
        Hardware version: 0
        Node GUID: 0x1070fd0300f9bb50
        System image GUID: 0x1070fd0300f9bb50
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 400
                Base lid: 31
                LMC: 0
                SM lid: 1
                Capability mask: 0xa651e848
                Port GUID: 0x1070fd0300f9bb50
                Link layer: InfiniBand

Error 4 Info:

MLX5_SCATTER_TO_CQE=0 numactl --cpunodebind=0 --membind=0 ./ib_write_bw -F -a -d mlx5_0 --report_gbits -i 1 -R -l 4 -q 4
MLX5_SCATTER_TO_CQE=0 numactl --cpunodebind=0 --membind=0 ./ib_write_bw -F -a -d mlx5_0 --report_gbits -i 1 --use_cuda=0 -R -l 4 -q 4 10.0.0.30

---------------------------------------------------------------------------------------
Post List requested - CQ moderation will be the size of the post list
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 19:00
CUDA device 1: PCIe address is 3B:00
CUDA device 2: PCIe address is 4C:00
CUDA device 3: PCIe address is 5D:00
CUDA device 4: PCIe address is 9B:00
CUDA device 5: PCIe address is BB:00
CUDA device 6: PCIe address is CB:00
CUDA device 7: PCIe address is DB:00

Picking device No. 0
[pid = 40856, dev = 0] device name = [NVIDIA A100-SXM4-80GB]
creating CUDA Ctx
making it the current CUDA Ctx
cuMemAlloc() of a 67108864 bytes GPU buffer
allocated GPU buffer address at 000014a0da000000 pointer=0x14a0da000000
---------------------------------------------------------------------------------------
                    RDMA_Write Post List BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 4            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 Post List       : 4
 CQ Moderation   : 4
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : ON
 Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
 local address: LID 0x1f QPN 0x009a PSN 0x6f8716
 local address: LID 0x1f QPN 0x009b PSN 0xe9b248
 local address: LID 0x1f QPN 0x009c PSN 0x7f19f2
 local address: LID 0x1f QPN 0x009d PSN 0xac1759
 remote address: LID 0x08 QPN 0x0099 PSN 0x92da0
 remote address: LID 0x08 QPN 0x009a PSN 0x2d970a
 remote address: LID 0x08 QPN 0x009b PSN 0xc5cc8c
 remote address: LID 0x08 QPN 0x009c PSN 0x347eeb
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 Completion with error at client
 Failed status 4: wr_id 0 syndrom 0x51
scnt=512, ccnt=0
 Failed to complete run_iter_bw function successfully

Error 11 Info:

MLX5_SCATTER_TO_CQE=0 numactl --cpunodebind=0 --membind=0 ./ib_read_bw -F -a -d mlx5_0 --report_gbits -i 1 -R -l 4 -q 4 --use_cuda 0
MLX5_SCATTER_TO_CQE=0 numactl --cpunodebind=0 --membind=0 ./ib_read_bw -F -a -d mlx5_0 --report_gbits -i 1 --use_cuda=0 -R -l 4 -q 4 10.0.0.30

---------------------------------------------------------------------------------------
Post List requested - CQ moderation will be the size of the post list
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 19:00
CUDA device 1: PCIe address is 3B:00
CUDA device 2: PCIe address is 4C:00
CUDA device 3: PCIe address is 5D:00
CUDA device 4: PCIe address is 9B:00
CUDA device 5: PCIe address is BB:00
CUDA device 6: PCIe address is CB:00
CUDA device 7: PCIe address is DB:00

Picking device No. 0
[pid = 41135, dev = 0] device name = [NVIDIA A100-SXM4-80GB]
creating CUDA Ctx
making it the current CUDA Ctx
cuMemAlloc() of a 67108864 bytes GPU buffer
allocated GPU buffer address at 000014ed00000000 pointer=0x14ed00000000
---------------------------------------------------------------------------------------
                    RDMA_Read Post List BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 4            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 Post List       : 4
 CQ Moderation   : 4
 Mtu             : 4096[B]
 Link type       : IB
 Outstand reads  : 16
 rdma_cm QPs     : ON
 Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
 local address: LID 0x1f QPN 0x00a4 PSN 0x3fa14d
 local address: LID 0x1f QPN 0x00a5 PSN 0x3cb8ea
 local address: LID 0x1f QPN 0x00a6 PSN 0xb005b6
 local address: LID 0x1f QPN 0x00a7 PSN 0x8c1e96
 remote address: LID 0x08 QPN 0x00a3 PSN 0x95d334
 remote address: LID 0x08 QPN 0x00a4 PSN 0x430f0e
 remote address: LID 0x08 QPN 0x00a5 PSN 0x6d7e40
 remote address: LID 0x08 QPN 0x00a6 PSN 0x32e08f
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 Completion with error at client
 Failed status 11: wr_id 1 syndrom 0x89
scnt=512, ccnt=0

I'm out of ideas to try, so any help would be appreciated.

sshaulnv commented 1 year ago

Hi @lizraymond, I'm trying to reproduce your issue but without success. I'm using same setup(VM) with A100, but with NVIDIA Driver 535.00. did you try with different driver version?

Also, I think that even if you are using a baremetal, you still need to check the MMIO base address of the GPU, it could be out of the range for the IO devices.

lizraymond commented 1 year ago

This is the latest driver version that supports this level of GPU. No other versions are available for H100. I did try earlier R520 with A100 and it also failed.

The good news is I just tried with the latest master version of the perftest code and all issues seem to be fixed. Nothing else on this OS has been updated to my knowledge, although Ubuntu might have sneaked in a mandatory automatic security update somewhere.

Something in the 16 commits since the https://github.com/linux-rdma/perftest/releases/tag/v4.5-0.20 to this date seems to have fixed items.

lizraymond commented 1 year ago

closing per previous comment -- my issue seems to be resolved, I unfortunately do not have further time to debug.

linux-rdma / perftest

IBV_WC_LOC_PROT_ERR (4) and IBV_WC_REM_OP_ERR (11) when using GPU #200