linux-rdma / perftest

Infiniband Verbs Performance Tests
Other
620 stars 292 forks source link

one direction bandwidth testing fail with GPUdirect #289

Open ilovesouthpark opened 2 months ago

ilovesouthpark commented 2 months ago

Hello,

I am testing my 2 P100 in 2 nodes with 2 cx555 NICs. It is only successful from one direction but failed in the other. Success ./ib_write_bw --use_cuda=0 -a 10.10.10.11 ./ib_write_bw -d mlx5_0 --use_cuda=0 -a

Fail ./ib_write_bw --use_cuda=0 -a ethernet_read_keys: Couldn't read remote address Unable to read to socket/rdma_cm Failed to exchange data between server and clients

./ib_write_bw -d mlx5_0 --use_cuda=0 -a 10.10.10.10 Completion with error at client Failed status 4: wr_id 0 syndrom 0x51 scnt=128, ccnt=0 Failed to complete run_iter_bw function successfully

For the testing between both cx555 NICs the bandwidth testings work well.

Driver and Kernel: Both cx555 are the same driver and firmware Both P100 are th same driver but different vbios I am not using Nvidia open source kernel since P100 is not supported but i think it is not the problem of the kernel otherwise why one direction is still working.

For IOMMU 10.10.10.11 sudo dmesg | grep -i dmar [ 0.173076] DMAR: IOMMU disabled sudo dmesg | grep -i iommu [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.0-41-generic root=UUID=44a5d7a3-4f19-4106-8a8c-66301c2c9d14 ro intel_iommu=off quiet splash vt.handoff=7 [ 0.173010] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.8.0-41-generic root=UUID=44a5d7a3-4f19-4106-8a8c-66301c2c9d14 ro intel_iommu=off quiet splash vt.handoff=7 [ 0.173076] DMAR: IOMMU disabled [ 2.245922] iommu: Default domain type: Translated [ 2.245922] iommu: DMA domain TLB invalidation policy: lazy mode

10.10.10.10 sudo dmesg | grep -i dmar No iputput sudo dmesg | grep -i iommu [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.0-41-generic root=UUID=6e849e25-4931-4c06-8684-bb553962f200 ro amd_iommu=off quiet splash vt.handoff=7 [ 0.030879] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.8.0-41-generic root=UUID=6e849e25-4931-4c06-8684-bb553962f200 ro amd_iommu=off quiet splash vt.handoff=7 [ 1.861879] iommu: Default domain type: Translated [ 1.861879] iommu: DMA domain TLB invalidation policy: lazy mode i have set both iommu=off in the kernel but ouput are different.

What will the possible casue for this issue and how can i go deep to find the casue and find the solution.

Thanks

sshaulnv commented 2 months ago

Seems like an issue we encountered. It may be relate to the MMIO base in the system BIOS of the HV. please try this solution: https://www.dell.com/support/manuals/en-il/vmware-esxi-6.5.x/esxi6.5.x_rn_pub/virtual-machines-fail-to-power-on-when-system-bios-has-mmio-set-to-56-tb-with-supported-gpu-config?guid=guid-ab3ea7a8-b8ca-481a-b6e2-d83ab989dac5

ilovesouthpark commented 2 months ago

Seems like an issue we encountered. It may be relate to the MMIO base in the system BIOS of the HV. please try this solution: https://www.dell.com/support/manuals/en-il/vmware-esxi-6.5.x/esxi6.5.x_rn_pub/virtual-machines-fail-to-power-on-when-system-bios-has-mmio-set-to-56-tb-with-supported-gpu-config?guid=guid-ab3ea7a8-b8ca-481a-b6e2-d83ab989dac5

Thanks, i have noted this post and tried to find the coresponding setting in my bios (Z690 mainboard) and found one 4GB MMO one. In the default setting it links with Resize bar and i can disable it if i disable Resize Bar, i tried but failed. The direction which have mentioned issue still can not work but the other direction can. Hope someone else can share their solution or give some insigts. Thanks anyway.