tinygrad / open-gpu-kernel-modules

NVIDIA Linux open GPU with P2P support
Other
889 stars 81 forks source link

SOftware seems installed ok, but no P2P #21

Open thecaptain2000 opened 5 days ago

thecaptain2000 commented 5 days ago

NVIDIA Open GPU Kernel Modules Version

550

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

Ubuntu 22.04.5 LTS

Kernel Release

Linux ai-server 5.15.0-124-generic #134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-2fbe0316-3cc8-4b18-797e-de9975b5f814) GPU 1: NVIDIA GeForce RTX 4090 (UUID: GPU-21adc1c4-fcf0-de35-d8a5-8a864de22da8)

Describe the bug

openGPU installs fine, I built and the modules in OpenGPU (I did not build the modules when I installed the server) and all seems correct. The IOMMU is off, Large Bar is set to auto (there is no way to enable it, just auto/disable)

Nvidia-sme reports:

NVIDIA-SMI 550.67 Driver Version: 550.67 CUDA Version: 12.4

simpleP2P reports:

checking GPU(s) for support of peer to peer memory access...

Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : No

Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : No

Two or more GPUs with Peer-to-Peer access capability are required for ./simpleP2P.

Peer to Peer access is not available amongst GPUs in the system, waiving test.

I created the modules using the open P2P software only, I did not make the modules when installing the NVIDIA driver, so I can presume they are the correct modules

My motherboard is a TRX40 Designare with a threadripper 3970, large BAR support and IOMMU off. Is there anything else I need to enable / disable / install / uninstall, etc?

To Reproduce

well, I just followed the installation instructions for the kernel version 550

Bug Incidence

Always

nvidia-bug-report.log.gz

there is no bug, it just does not work

More Info

to have P2P working? :)

mylesgoose commented 5 days ago

large BAR support "ON "and IOMMU off. there is no way to enable it, just auto. Auto is enabled. Did you blacklist the neuveo drivers?

thecaptain2000 commented 5 days ago

I did not blacklist anything, I installed the nvidia driver from the NVIDIA-Linux-x86_64-550.67.run file not from any deb package. To my knowledge there is no way the neuveo driver could be installed, unless I am missing something. How would I check it? if I do:

sudo lshw -c video | grep 'configuration'

I get:

configuration: driver=nvidia latency=0 configuration: driver=nvidia latency=0 configuration: depth=32 resolution=1024,768

lspci | grep VGA

I get:

21:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1) 4a:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1)

Screenshot 2024-10-16 at 22 11 35

thecaptain2000 commented 5 days ago

I also made an attempt to blacklist Nouveau drivers. I created a file: /etc/modprobe.d/blacklist-nvidia-nouveau with two lines in it:

blacklist nouveau options nouveau modset=0

then run:

sudo update-initramfs -u

then rebooted the system. Same result:

[/home/renato/cuda-samples-master/Samples/0_Introduction/simpleP2P/simpleP2P] - Starting... Checking for multiple GPUs... CUDA-capable device count: 2

Checking GPU(s) for support of peer to peer memory access...

Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : No Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : No Two or more GPUs with Peer-to-Peer access capability are required for /home/renato/cuda-samples-master/Samples/0_Introduction/simpleP2P/simpleP2P. Peer to Peer access is not available amongst GPUs in the system, waiving test.

is there a way to doublecheck whether the P2P modules are installed correctly?

mylesgoose commented 5 days ago

I think I see your problem. First how did you install the nvidia driver from a run file? You must have been looking at the screen. Hence the open source driver was installed first. When you install the run file did you say --no-modules-kernel or whatever it is. Perhaps you installed the modules from the run file. Which is no problem just replace them with the ones from the geohot Deb package. I also noticed when you install the modules from the Deb package it installs I to a different location then the modules installed by the run file. Or apt. So your best solution is to run that run file and uninstall. Purge the drivers using the run file itself. The go apt purge nvidia-* etc. Remove all nvidia drivers. But have the nvidia driver handy ready to run that installer again. Now run the installer and let it install tye modules for the kernel. Reboot check everything works. Then find out where those files are stored. Override them with your modules using the terminal copy and paste over the model's with your modified ones. Then you can also run that installer script to double check. Then you have to regenerate the kernel that loads maybe so it actually uses your modules. I think. And reboot. If that does not work simply purge everything again and also unsiall unblocklist neuveo and uninstall nvidia deiver with with runfile reboot and let the standard nvidia driver neuveo work. Because the neuveo deiver is working you wonr have any problems overriding the modules as they wont be in use. And then ensure there is no lingering drivers with apt and that apt is not auto installing updates. And run the installer this time with no kernel modules. Flag. And then install your kernel modules and then you will see nvidia smi is working.

mylesgoose commented 5 days ago

How to Build To build:

make modules -j$(nproc) To install, first uninstall any existing NVIDIA kernel modules. Then, as root:

make modules_install -j$(nproc) Note that the kernel modules built here must be used with GSP firmware and user-space NVIDIA GPU driver components from a corresponding 560.35.03 driver release. This can be achieved by installing the NVIDIA GPU driver from the .run file using the --no-kernel-modules option. E.g.,

sh ./NVIDIA-Linux-[...].run --no-kernel-modules Supported Target CP

mylesgoose commented 5 days ago

I can see in your nvdia-smi that that driver is loaded. So you almost there. Your problem is your using the kernel modules from the run file.

thecaptain2000 commented 4 days ago

I think I see your problem. First how did you install the nvidia driver from a run file? You must have been looking at the screen. Hence the open source driver was installed first. When you install the run file did you say --no-modules-kernel or whatever it is. Perhaps you installed the modules from the run file. Which is no problem just replace them with the ones from the geohot Deb package. I also noticed when you install the modules from the Deb package it installs I to a different location then the modules installed by the run file. Or apt. So your best solution is to run that run file and uninstall. Purge the drivers using the run file itself. The go apt purge nvidia-* etc. Remove all nvidia drivers. But have the nvidia driver handy ready to run that installer again. Now run the installer and let it install tye modules for the kernel. Reboot check everything works. Then find out where those files are stored. Override them with your modules using the terminal copy and paste over the model's with your modified ones. Then you can also run that installer script to double check. Then you have to regenerate the kernel that loads maybe so it actually uses your modules. I think. And reboot. If that does not work simply purge everything again and also unsiall unblocklist neuveo and uninstall nvidia deiver with with runfile reboot and let the standard nvidia driver neuveo work. Because the neuveo deiver is working you wonr have any problems overriding the modules as they wont be in use. And then ensure there is no lingering drivers with apt and that apt is not auto installing updates. And run the installer this time with no kernel modules. Flag. And then install your kernel modules and then you will see nvidia smi is working.

First of all, thank you for your help!

I started with ubuntu server 22.04, I presume with no drivers as I seleced not to install any third party driver . then, these are the commands I executed, in the exact order :

sudo ./NVIDIA-Linux-x86_64-550.67.run --no-kernel-modules then I went to the open-gpu-kernel-modules-550 directory and run

make modules -j$(nproc) sudo make modules_install -j$(nproc)

as I had to build the simpleP2P and nvbandwidth tools, I downloaded cuda_12.4.0_550.54.14_linux.run. Again in the ".run" format to be in control of what was being installed. Strangely enough it asked me if I wanted to install a different and newer driver, to which I said no.

I then compiled nvbandwith and simpleP2P correctly and then I went on blacklisting the nouveau driver as I mentioned before.

Is there a way for me to check the modules installed are indeed the one of the open-gpu-kerne-modules-550? I believe I did the outmost to make sure they are the ONLY modules ever built. I did not install any .deb package which may have overwritten those modules. Bfore I re try the process, starting from the re-installation of ubuntu, I would like to know whether there is any verification / change to my process, I can do, not to end up again in the same place.

thecaptain2000 commented 4 days ago

How to Build To build:

make modules -j$(nproc) To install, first uninstall any existing NVIDIA kernel modules. Then, as root:

make modules_install -j$(nproc) Note that the kernel modules built here must be used with GSP firmware and user-space NVIDIA GPU driver components from a corresponding 560.35.03 driver release. This can be achieved by installing the NVIDIA GPU driver from the .run file using the --no-kernel-modules option. E.g.,

sh ./NVIDIA-Linux-[...].run --no-kernel-modules Supported Target CP

err, I used the nvidia driver version 550.67 , not the 560.35.03. There isn't a open-gpu-kernel-modules-xxx branch for the driver version 560xxx. did you use the 560.35.03 driver with the 550 branch? I used the 550.67 as it is mentioned in the branch description that is the driver to use :

"Note that the kernel modules built here must be used with GSP firmware and user-space NVIDIA GPU driver components from a corresponding 550.67 driver release. This can be achieved by installing the NVIDIA GPU driver from the .run file using the --no-kernel-modules option."

everything seems to be compiling and installing fine with the 550.67, but for the fact it does not work :)) . is the use of the 550.67 nvidia driver that is causing the problem?

mylesgoose commented 4 days ago

@thecaptain2000 https://github.com/tinygrad/open-gpu-kernel-modules/releases/download/550.90.07-p2p/nvidia-kernel-source-550-open-0ubuntu1_amd64.deb this is already pre compiled right. so purge yuor system of all drivers. then install this deve package pre compiled. and then install the matching run file with no kernel modules and reboot.if it works then you can compile from source if you like.

thecaptain2000 commented 4 days ago

@thecaptain2000 https://github.com/tinygrad/open-gpu-kernel-modules/releases/download/550.90.07-p2p/nvidia-kernel-source-550-open-0ubuntu1_amd64.deb this is already pre compiled right. so purge yuor system of all drivers. then install this deve package pre compiled. and then install the matching run file with no kernel modules and reboot.if it works then you can compile from source if you like.

@mylesgoose, I will give it a go. I will let you know of the progress. Thank you again in the meantime

mylesgoose commented 4 days ago

https://www.nvidia.com/download/driverResults.aspx/226768/en-us/ @thecaptain2000

thecaptain2000 commented 4 days ago

drivers

@mylesgoose

Soo, I installed the 550.90.07 driver from the run file this way

sudo ./NVIDIA-Linux-x86_64-550.90.07.run --no-kernel-modules

as you mentioned. even BEFORE installing the driver, I executed:

dpkg -i nvidia-kernel-source-550-open-0ubuntu1_amd64.deb

if I execute nvidia-smi I get: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

I tried to reboot the pc and it did not help

I tried to execute dpkg -i nvidia-kernel-source-550-open-0ubuntu1_amd64.deb also after the driver installation but the situation remained the same. I also tried executing sudo apt install dpkg -i nvidia-kernel-source-550-open-0ubuntu1_amd64.deb . same result

my doubt at this point is I need to specify a different directory. when I execute the dpkg -i nvidia-kernel-source-550-open-0ubuntu1_amd64.deb

is that the case?

mylesgoose commented 4 days ago

well i think you should run the install wit the removal of the no kernel thing. see where it copies the files to and then replace them with the ones from the deb package and then rebuild intramf . these are my ones. /usr/lib/modules/6.8.0-44-generic/kernel/drivers/video/backlight /usr/lib/modules/6.8.0-44-generic/kernel/drivers/video/fbdev /usr/lib/modules/6.8.0-44-generic/kernel/drivers/video/nvidia.ko /usr/lib/modules/6.8.0-44-generic/kernel/drivers/video/nvidia-drm.ko /usr/lib/modules/6.8.0-44-generic/kernel/drivers/video/nvidia-modeset.ko /usr/lib/modules/6.8.0-44-generic/kernel/drivers/video/nvidia-peermem.ko

modules.zip sorry it would not fit if a just zipped it so unlizip then un tar

/usr/lib/modules/6.8.0-44-generic/kernel/drivers/video/nvidia-uvm.ko /usr/lib/modules/6.8.0-44-generic/kernel/drivers/video/vgastate.ko hopefully you don't have secure boot on. as if you replace the modules with unsigned ones it wont work. So waht im sayign is run the isntallre run file and actually isntall the kernel moduels. ensure nvida-smi is working rights. and then simply replace the modified ones with thous ones and then rebuild the intramfs

mylesgoose commented 4 days ago

modules.zip here is the full contents of video /usr/lib/modules/6.8.0-44-generic/kernel/drivers/video

thecaptain2000 commented 4 days ago

modules.zip here is the full contents of video /usr/lib/modules/6.8.0-44-generic/kernel/drivers/video

@mylesgoose I am getting somewhere. While I was waiting for your response, I performed a clean linux install and installed 550.67 and compiled and installed the open modules. I had a hunch that the modules were actually working when I originally installed them, but that somewhere / somehow they were getting overridden

so after the clean install of linux + modules, I installed my python + pytorch anvironment and run torch.zeros(70000,70000).cuda().to("cuda:1"). It took 3.9 seconds. where before it was taking something shy of 8 seconds.

problem is, at that point I could not run simpleP2P and nvbandwith as I did not have them anywhere else, so I installed the cuda toolkit (again from a .run file) asking not to install anything but the cuda toolkit.

I re run the torch.zeros(70000,70000).cuda().to("cuda:1") and boom, it was taking 8 seconds again, which means the cuda toolkit overriden all / part of the nvidia modules.

now I just compiled and run simpleP2P and it tells me there is no P2P. so what I will do now is I wil build also nvbandwith and save them, hopefully they do not need any library to run and I will be able to recreate the initial situation where, I suspect, the whole "toy" was running as expected with P2P enabled before the installation of the cuda toolkit

mylesgoose commented 4 days ago

alltoall_perf.zip why do you want to use tat old version 550.67

mylesgoose commented 4 days ago

Whe you install that Deb package it does not put in the corect location. If you installed cuda and it replaced your driver why not just reinstall the driver again or replace the modules. That it replaced.

thecaptain2000 commented 4 days ago

Whe you install that Deb package it does not put in the corect location. If you installed cuda and it replaced your driver why not just reinstall the driver again or replace the modules. That it replaced.

Well

alltoall_perf.zip why do you want to use tat old version 550.67

well, given that once I installed the 550.90.07 driver and compiled the module, it worked the first time, I would say "Because I am an idiot :)) " Screenshot 2024-10-17 at 16 52 28 Screenshot 2024-10-17 at 16 52 53

thecaptain2000 commented 4 days ago

Whe you install that Deb package it does not put in the corect location. If you installed cuda and it replaced your driver why not just reinstall the driver again or replace the modules. That it replaced.

Thank you for helping me trough this

keithyau commented 3 days ago

Whe you install that Deb package it does not put in the corect location. If you installed cuda and it replaced your driver why not just reinstall the driver again or replace the modules. That it replaced.

Well

alltoall_perf.zip why do you want to use tat old version 550.67

well, given that once I installed the 550.90.07 driver and compiled the module, it worked the first time, I would say "Because I am an idiot :)) " Screenshot 2024-10-17 at 16 52 28 Screenshot 2024-10-17 at 16 52 53

It stuck me for 2 days. How do you compile cuda-samples ( simpleP2P?) It prompt LargeKernelParameter error. Thank you very much !

mylesgoose commented 3 days ago

are you trying to compile all of the samples or just that simple p2p. i didn want to recompile all of them so i just copied that simple p2p folder to my desktop open a terminnl inside that folder and type make clean and then " sudo make INCLUDES="-I../../../Common -I/home/myles/cuda-samples/Common" " because your ether going to be in the directory bellow that common files folder with cuda helper etc headers or you jut link to it

mylesgoose commented 2 days ago

Whe you install that Deb package it does not put in the corect location. If you installed cuda and it replaced your driver why not just reinstall the driver again or replace the modules. That it replaced.

Thank you for helping me trough this

@thecaptain2000 hey can you try this newer version https://github.com/mylesgoose/open-gpu-kernel-modules/archive/refs/tags/550.90.07-p2p.zip make sure you isntall the run file corresponding to that newer release

keithyau commented 2 days ago

are you trying to compile all of the samples or just that simple p2p. i didn want to recompile all of them so i just copied that simple p2p folder to my desktop open a terminnl inside that folder and type make clean and then " sudo make INCLUDES="-I../../../Common -I/home/myles/cuda-samples/Common" " because your ether going to be in the directory bellow that common files folder with cuda helper etc headers or you jut link to it

O.. shit !! it works while i copy it outside ! I were keep "make" inside the 0_introductoin folder.

thank you so much ! image

keithyau commented 2 days ago

The problem ended with a fail even P2P is enabled. Any idea?

image

mylesgoose commented 2 days ago

@keithyau which NVIDIA driver did you install?