Closed lizraymond closed 1 year ago
Hi @lizraymond, I'm trying to reproduce your issue but without success. I'm using same setup(VM) with A100, but with NVIDIA Driver 535.00. did you try with different driver version?
Also, I think that even if you are using a baremetal, you still need to check the MMIO base address of the GPU, it could be out of the range for the IO devices.
This is the latest driver version that supports this level of GPU. No other versions are available for H100. I did try earlier R520 with A100 and it also failed.
The good news is I just tried with the latest master version of the perftest code and all issues seem to be fixed. Nothing else on this OS has been updated to my knowledge, although Ubuntu might have sneaked in a mandatory automatic security update somewhere.
Something in the 16 commits since the https://github.com/linux-rdma/perftest/releases/tag/v4.5-0.20 to this date seems to have fixed items.
closing per previous comment -- my issue seems to be resolved, I unfortunately do not have further time to debug.
Hi all,
I am trying to run rdma perftest (specifically ib_write_bw & ib_read_bw) between two identical nodes but continue encountering errors. The client dies, but I have to manually cancel the server.
Server configuration:
I am using GPU0 & mlx5_0, which are on the same PCIe switch and therefore the same numa node.
I found the issues where someone fixed their problem with MMIO, but I am using a baremetal platform and I am root user; the process should be able to access literally anything it likes. I also set ulimit max memory size to unlimited. I also found the issue where iommu is at fault, but I have disabled iommu in the grub command line as well as turned off all SRIOV & virtualized functions in the system BIOS, and none of it helped.
rping & ibv_rc_pingpong work fine, and ping can hit the IPoIB addresses with no issue. MTU is set to the max of 4096 by the SM, which supports IPoIB.
Mellanox Info:
Error 4 Info:
Error 11 Info:
I'm out of ideas to try, so any help would be appreciated.