Open thecaptain2000 opened 1 month ago
large BAR support "ON "and IOMMU off. there is no way to enable it, just auto. Auto is enabled. Did you blacklist the neuveo drivers?
I did not blacklist anything, I installed the nvidia driver from the NVIDIA-Linux-x86_64-550.67.run file not from any deb package. To my knowledge there is no way the neuveo driver could be installed, unless I am missing something. How would I check it? if I do:
sudo lshw -c video | grep 'configuration'
I get:
configuration: driver=nvidia latency=0 configuration: driver=nvidia latency=0 configuration: depth=32 resolution=1024,768
lspci | grep VGA
I get:
21:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1) 4a:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1)
I also made an attempt to blacklist Nouveau drivers. I created a file: /etc/modprobe.d/blacklist-nvidia-nouveau with two lines in it:
blacklist nouveau options nouveau modset=0
then run:
sudo update-initramfs -u
then rebooted the system. Same result:
[/home/renato/cuda-samples-master/Samples/0_Introduction/simpleP2P/simpleP2P] - Starting... Checking for multiple GPUs... CUDA-capable device count: 2
Checking GPU(s) for support of peer to peer memory access...
Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : No Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : No Two or more GPUs with Peer-to-Peer access capability are required for /home/renato/cuda-samples-master/Samples/0_Introduction/simpleP2P/simpleP2P. Peer to Peer access is not available amongst GPUs in the system, waiving test.
is there a way to doublecheck whether the P2P modules are installed correctly?
I think I see your problem. First how did you install the nvidia driver from a run file? You must have been looking at the screen. Hence the open source driver was installed first. When you install the run file did you say --no-modules-kernel or whatever it is. Perhaps you installed the modules from the run file. Which is no problem just replace them with the ones from the geohot Deb package. I also noticed when you install the modules from the Deb package it installs I to a different location then the modules installed by the run file. Or apt. So your best solution is to run that run file and uninstall. Purge the drivers using the run file itself. The go apt purge nvidia-* etc. Remove all nvidia drivers. But have the nvidia driver handy ready to run that installer again. Now run the installer and let it install tye modules for the kernel. Reboot check everything works. Then find out where those files are stored. Override them with your modules using the terminal copy and paste over the model's with your modified ones. Then you can also run that installer script to double check. Then you have to regenerate the kernel that loads maybe so it actually uses your modules. I think. And reboot. If that does not work simply purge everything again and also unsiall unblocklist neuveo and uninstall nvidia deiver with with runfile reboot and let the standard nvidia driver neuveo work. Because the neuveo deiver is working you wonr have any problems overriding the modules as they wont be in use. And then ensure there is no lingering drivers with apt and that apt is not auto installing updates. And run the installer this time with no kernel modules. Flag. And then install your kernel modules and then you will see nvidia smi is working.
How to Build To build:
make modules -j$(nproc) To install, first uninstall any existing NVIDIA kernel modules. Then, as root:
make modules_install -j$(nproc) Note that the kernel modules built here must be used with GSP firmware and user-space NVIDIA GPU driver components from a corresponding 560.35.03 driver release. This can be achieved by installing the NVIDIA GPU driver from the .run file using the --no-kernel-modules option. E.g.,
sh ./NVIDIA-Linux-[...].run --no-kernel-modules Supported Target CP
I can see in your nvdia-smi that that driver is loaded. So you almost there. Your problem is your using the kernel modules from the run file.
I think I see your problem. First how did you install the nvidia driver from a run file? You must have been looking at the screen. Hence the open source driver was installed first. When you install the run file did you say --no-modules-kernel or whatever it is. Perhaps you installed the modules from the run file. Which is no problem just replace them with the ones from the geohot Deb package. I also noticed when you install the modules from the Deb package it installs I to a different location then the modules installed by the run file. Or apt. So your best solution is to run that run file and uninstall. Purge the drivers using the run file itself. The go apt purge nvidia-* etc. Remove all nvidia drivers. But have the nvidia driver handy ready to run that installer again. Now run the installer and let it install tye modules for the kernel. Reboot check everything works. Then find out where those files are stored. Override them with your modules using the terminal copy and paste over the model's with your modified ones. Then you can also run that installer script to double check. Then you have to regenerate the kernel that loads maybe so it actually uses your modules. I think. And reboot. If that does not work simply purge everything again and also unsiall unblocklist neuveo and uninstall nvidia deiver with with runfile reboot and let the standard nvidia driver neuveo work. Because the neuveo deiver is working you wonr have any problems overriding the modules as they wont be in use. And then ensure there is no lingering drivers with apt and that apt is not auto installing updates. And run the installer this time with no kernel modules. Flag. And then install your kernel modules and then you will see nvidia smi is working.
First of all, thank you for your help!
I started with ubuntu server 22.04, I presume with no drivers as I seleced not to install any third party driver . then, these are the commands I executed, in the exact order :
sudo ./NVIDIA-Linux-x86_64-550.67.run --no-kernel-modules then I went to the open-gpu-kernel-modules-550 directory and run
make modules -j$(nproc) sudo make modules_install -j$(nproc)
as I had to build the simpleP2P and nvbandwidth tools, I downloaded cuda_12.4.0_550.54.14_linux.run. Again in the ".run" format to be in control of what was being installed. Strangely enough it asked me if I wanted to install a different and newer driver, to which I said no.
I then compiled nvbandwith and simpleP2P correctly and then I went on blacklisting the nouveau driver as I mentioned before.
Is there a way for me to check the modules installed are indeed the one of the open-gpu-kerne-modules-550? I believe I did the outmost to make sure they are the ONLY modules ever built. I did not install any .deb package which may have overwritten those modules. Bfore I re try the process, starting from the re-installation of ubuntu, I would like to know whether there is any verification / change to my process, I can do, not to end up again in the same place.
How to Build To build:
make modules -j$(nproc) To install, first uninstall any existing NVIDIA kernel modules. Then, as root:
make modules_install -j$(nproc) Note that the kernel modules built here must be used with GSP firmware and user-space NVIDIA GPU driver components from a corresponding 560.35.03 driver release. This can be achieved by installing the NVIDIA GPU driver from the .run file using the --no-kernel-modules option. E.g.,
sh ./NVIDIA-Linux-[...].run --no-kernel-modules Supported Target CP
err, I used the nvidia driver version 550.67 , not the 560.35.03. There isn't a open-gpu-kernel-modules-xxx branch for the driver version 560xxx. did you use the 560.35.03 driver with the 550 branch? I used the 550.67 as it is mentioned in the branch description that is the driver to use :
"Note that the kernel modules built here must be used with GSP firmware and user-space NVIDIA GPU driver components from a corresponding 550.67 driver release. This can be achieved by installing the NVIDIA GPU driver from the .run file using the --no-kernel-modules option."
everything seems to be compiling and installing fine with the 550.67, but for the fact it does not work :)) . is the use of the 550.67 nvidia driver that is causing the problem?
@thecaptain2000 https://github.com/tinygrad/open-gpu-kernel-modules/releases/download/550.90.07-p2p/nvidia-kernel-source-550-open-0ubuntu1_amd64.deb this is already pre compiled right. so purge yuor system of all drivers. then install this deve package pre compiled. and then install the matching run file with no kernel modules and reboot.if it works then you can compile from source if you like.
@thecaptain2000 https://github.com/tinygrad/open-gpu-kernel-modules/releases/download/550.90.07-p2p/nvidia-kernel-source-550-open-0ubuntu1_amd64.deb this is already pre compiled right. so purge yuor system of all drivers. then install this deve package pre compiled. and then install the matching run file with no kernel modules and reboot.if it works then you can compile from source if you like.
@mylesgoose, I will give it a go. I will let you know of the progress. Thank you again in the meantime
https://www.nvidia.com/download/driverResults.aspx/226768/en-us/ @thecaptain2000
drivers
@mylesgoose
Soo, I installed the 550.90.07 driver from the run file this way
sudo ./NVIDIA-Linux-x86_64-550.90.07.run --no-kernel-modules
as you mentioned. even BEFORE installing the driver, I executed:
dpkg -i nvidia-kernel-source-550-open-0ubuntu1_amd64.deb
if I execute nvidia-smi I get: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
I tried to reboot the pc and it did not help
I tried to execute dpkg -i nvidia-kernel-source-550-open-0ubuntu1_amd64.deb also after the driver installation but the situation remained the same. I also tried executing sudo apt install dpkg -i nvidia-kernel-source-550-open-0ubuntu1_amd64.deb . same result
my doubt at this point is I need to specify a different directory. when I execute the dpkg -i nvidia-kernel-source-550-open-0ubuntu1_amd64.deb
is that the case?
well i think you should run the install wit the removal of the no kernel thing. see where it copies the files to and then replace them with the ones from the deb package and then rebuild intramf . these are my ones. /usr/lib/modules/6.8.0-44-generic/kernel/drivers/video/backlight /usr/lib/modules/6.8.0-44-generic/kernel/drivers/video/fbdev /usr/lib/modules/6.8.0-44-generic/kernel/drivers/video/nvidia.ko /usr/lib/modules/6.8.0-44-generic/kernel/drivers/video/nvidia-drm.ko /usr/lib/modules/6.8.0-44-generic/kernel/drivers/video/nvidia-modeset.ko /usr/lib/modules/6.8.0-44-generic/kernel/drivers/video/nvidia-peermem.ko
modules.zip sorry it would not fit if a just zipped it so unlizip then un tar
/usr/lib/modules/6.8.0-44-generic/kernel/drivers/video/nvidia-uvm.ko /usr/lib/modules/6.8.0-44-generic/kernel/drivers/video/vgastate.ko hopefully you don't have secure boot on. as if you replace the modules with unsigned ones it wont work. So waht im sayign is run the isntallre run file and actually isntall the kernel moduels. ensure nvida-smi is working rights. and then simply replace the modified ones with thous ones and then rebuild the intramfs
modules.zip here is the full contents of video /usr/lib/modules/6.8.0-44-generic/kernel/drivers/video
modules.zip here is the full contents of video /usr/lib/modules/6.8.0-44-generic/kernel/drivers/video
@mylesgoose I am getting somewhere. While I was waiting for your response, I performed a clean linux install and installed 550.67 and compiled and installed the open modules. I had a hunch that the modules were actually working when I originally installed them, but that somewhere / somehow they were getting overridden
so after the clean install of linux + modules, I installed my python + pytorch anvironment and run torch.zeros(70000,70000).cuda().to("cuda:1"). It took 3.9 seconds. where before it was taking something shy of 8 seconds.
problem is, at that point I could not run simpleP2P and nvbandwith as I did not have them anywhere else, so I installed the cuda toolkit (again from a .run file) asking not to install anything but the cuda toolkit.
I re run the torch.zeros(70000,70000).cuda().to("cuda:1") and boom, it was taking 8 seconds again, which means the cuda toolkit overriden all / part of the nvidia modules.
now I just compiled and run simpleP2P and it tells me there is no P2P. so what I will do now is I wil build also nvbandwith and save them, hopefully they do not need any library to run and I will be able to recreate the initial situation where, I suspect, the whole "toy" was running as expected with P2P enabled before the installation of the cuda toolkit
alltoall_perf.zip why do you want to use tat old version 550.67
Whe you install that Deb package it does not put in the corect location. If you installed cuda and it replaced your driver why not just reinstall the driver again or replace the modules. That it replaced.
Whe you install that Deb package it does not put in the corect location. If you installed cuda and it replaced your driver why not just reinstall the driver again or replace the modules. That it replaced.
Well
alltoall_perf.zip why do you want to use tat old version 550.67
well, given that once I installed the 550.90.07 driver and compiled the module, it worked the first time, I would say "Because I am an idiot :)) "
Whe you install that Deb package it does not put in the corect location. If you installed cuda and it replaced your driver why not just reinstall the driver again or replace the modules. That it replaced.
Thank you for helping me trough this
Whe you install that Deb package it does not put in the corect location. If you installed cuda and it replaced your driver why not just reinstall the driver again or replace the modules. That it replaced.
Well
alltoall_perf.zip why do you want to use tat old version 550.67
well, given that once I installed the 550.90.07 driver and compiled the module, it worked the first time, I would say "Because I am an idiot :)) "
It stuck me for 2 days. How do you compile cuda-samples ( simpleP2P?) It prompt LargeKernelParameter error. Thank you very much !
are you trying to compile all of the samples or just that simple p2p. i didn want to recompile all of them so i just copied that simple p2p folder to my desktop open a terminnl inside that folder and type make clean and then " sudo make INCLUDES="-I../../../Common -I/home/myles/cuda-samples/Common" " because your ether going to be in the directory bellow that common files folder with cuda helper etc headers or you jut link to it
Whe you install that Deb package it does not put in the corect location. If you installed cuda and it replaced your driver why not just reinstall the driver again or replace the modules. That it replaced.
Thank you for helping me trough this
@thecaptain2000 hey can you try this newer version https://github.com/mylesgoose/open-gpu-kernel-modules/tree/560.35.03-p2p make sure you install the run file corresponding to that newer release
are you trying to compile all of the samples or just that simple p2p. i didn want to recompile all of them so i just copied that simple p2p folder to my desktop open a terminnl inside that folder and type make clean and then " sudo make INCLUDES="-I../../../Common -I/home/myles/cuda-samples/Common" " because your ether going to be in the directory bellow that common files folder with cuda helper etc headers or you jut link to it
O.. shit !! it works while i copy it outside ! I were keep "make" inside the 0_introductoin folder.
thank you so much !
The problem ended with a fail even P2P is enabled. Any idea?
@keithyau which NVIDIA driver did you install?
@keithyau which NVIDIA driver did you install?
560 and then patch the tinygrad p2p update into it.
Well I can see your problem there you have not patched it correctly and missed a file as is showing in your screen print
https://github.com/mylesgoose/open-gpu-kernel-modules/tree/560.35.03-p2p @keithyau try this one because that one your using does not have this file done obviously https://github.com/mylesgoose/open-gpu-kernel-modules/commit/1ca8b013afcaa3bad164c9ed2694064ea399a3c9
https://github.com/mylesgoose/open-gpu-kernel-modules/tree/560.35.03-p2p @keithyau try this one because that one your using does not have this file done obviously mylesgoose@1ca8b01
Thank you !
NVIDIA Open GPU Kernel Modules Version
550
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
Operating System and Version
Ubuntu 22.04.5 LTS
Kernel Release
Linux ai-server 5.15.0-124-generic #134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
Hardware: GPU
GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-2fbe0316-3cc8-4b18-797e-de9975b5f814) GPU 1: NVIDIA GeForce RTX 4090 (UUID: GPU-21adc1c4-fcf0-de35-d8a5-8a864de22da8)
Describe the bug
openGPU installs fine, I built and the modules in OpenGPU (I did not build the modules when I installed the server) and all seems correct. The IOMMU is off, Large Bar is set to auto (there is no way to enable it, just auto/disable)
Nvidia-sme reports:
NVIDIA-SMI 550.67 Driver Version: 550.67 CUDA Version: 12.4
simpleP2P reports:
checking GPU(s) for support of peer to peer memory access...
Two or more GPUs with Peer-to-Peer access capability are required for ./simpleP2P.
Peer to Peer access is not available amongst GPUs in the system, waiving test.
I created the modules using the open P2P software only, I did not make the modules when installing the NVIDIA driver, so I can presume they are the correct modules
My motherboard is a TRX40 Designare with a threadripper 3970, large BAR support and IOMMU off. Is there anything else I need to enable / disable / install / uninstall, etc?
To Reproduce
well, I just followed the installation instructions for the kernel version 550
Bug Incidence
Always
nvidia-bug-report.log.gz
there is no bug, it just does not work
More Info
to have P2P working? :)