Closed hassanbabaie closed 1 week ago
can you please check if the GPU driver installed as open-kernel? run: modinfo nvidia and check the 'license' (should be 'Dual MIT/GPL')
@sshaulnv yes it is the open driver
However we are retesting and I will post the update here (very soon) and the output of modinfo nvidia
Hi @sshaulnv , yes the output on the host is the following:
modinfo nvidia
filename: /lib/modules/5.14.0-284.30.1.el9_2.x86_64/extra/nvidia.ko.xz
firmware: nvidia/535.183.06/gsp_tu10x.bin
firmware: nvidia/535.183.06/gsp_ga10x.bin
import_ns: DMA_BUF
alias: char-major-195-*
version: 535.183.06
supported: external
license: Dual MIT/GPL
rhelversion: 9.2
We removed/disable selinux off the host but no luck
Is there anyway to get a more verbose error?
Everything I have checks says that this should work....
@hassanbabaie where you able to proceed, ? What GPU you are using? can you check if gpu driver is loaded? does this work with nvidia_peermem?
Yes @alokprasad , we were able to resolve the issue, it was related to a package with a conflicting driver.
Hi, I'm testing RDMA via RoCEv2 connectivity and we're using
dma-buf
instead ofnv-peer-mem
and it's failing but I'm unsure of the fix/why.I setup the test on Ubuntu Kubernetes Pods (based on then Nvidia NGC image) and installed:
Then I ran the following command between them:
The output I got was:
Cuda info:
Note the non-RDMA/CUDA one works fine
Any thoughts / ideas would be appreciated