pop-os / nvidia-graphics-drivers

Pop!_OS NVIDIA Graphics Drivers
134 stars 7 forks source link

Oryp7 Nvidia GPU issues "RmInitAdapter failed!" #113

Closed bflanagin closed 5 months ago

bflanagin commented 2 years ago

Based on issues reported by support and internal testing the discrete video card is failing on the oryp7 and reverting to the integrated video card. When this happens system76-driver still reports that the system is in nvidia mode.

Dmesg reports:

NVRM: GPU 000:01:00.0: RmInitAdapter failed!(0x23:0x56:643
NVRM: GPU 000:01:00.0: rm_init_adapter failed, device minor number 0

The issue occurs randomly after reboot or power cycle and can be remedied the same way.

The issue can be replicated on Pop 20.10 and 21.04, as well as Ubuntu 20.10 and Windows10

arnaudsj commented 2 years ago

I second that. It has been a problem since Nvidia driver 465 for me. On my end, I also get a loud fan noise when the nvidia discrete GPU is not detected. I confirm that it only happens in NVIDIA mode (not hybrid mode). A typical reboot does not always fix it for me, however I have found that causing the system to delay its boot a few secs (by pressing Esc to get the boot menu) solves the problem. So it happens to be a timing issue (the Nvidia card not initializing fast enough?)

leviport commented 2 years ago

Well I can't seem to make it happen with https://github.com/pop-os/nvidia-graphics-drivers/pull/114 on 20.04. I'll keep trying, but this is looking promising.

mitchelljohnmartel1 commented 2 years ago

oryp7 with a 3070 dmesg.txt

bflanagin commented 2 years ago

Here are the logs created by nvidia-bug-report.

The line that includes "Failed to allocate NvKmsKapiDevice" may be relevant to the issue as it only appears in the not_working logs.

oryp7-3070-nvidia-bug-report.not_working.log.gz oryp7-3070-nvidia-bug-report.log.gz

cstrahan-blueshift commented 2 years ago

Is there a workaround for this?

$ sudo dmesg | grep nvidia
[sudo] password for cstrahan: 
[   13.658667] nvidia: module license 'NVIDIA' taints kernel.
[   13.705633] nvidia-nvlink: Nvlink Core is being initialized, major device number 508
[   13.708498] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[   13.778779] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  470.86  Tue Oct 26 21:46:51 UTC 2021
[   13.799315] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[   15.264593] nvidia 0000:01:00.0: can't change power state from D3cold to D0 (config space inaccessible)
[   15.265282] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
[   15.265579] [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device
[   15.272756] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[   15.275985] nvidia-uvm: Loaded the UVM driver, major device number 506.
[   16.865132] audit: type=1400 audit(1644425595.427:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1030 comm="apparmor_parser"
[   16.865139] audit: type=1400 audit(1644425595.427:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1030 comm="apparmor_parser"

$ sudo dmesg | grep NVRM
[   13.754980] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  470.86  Tue Oct 26 21:55:45 UTC 2021
[   15.265145] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:667)
[   15.265221] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   30.103488] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:667)
[   30.103565] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   30.112983] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:667)
[   30.113054] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   38.746389] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:667)
[   38.746444] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   38.746717] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:667)
[   38.746761] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
cstrahan-blueshift commented 2 years ago

I saw this comment on the Nvidia forums: https://forums.developer.nvidia.com/t/bug-470-42-01-1-dgpu-can-not-be-initialized/183627/3

Same here with 470.57.02. Happened after a PopOS update. Intel graphics only and non functioning HDMI. Same computer (4k OLED version). First I thought the NVIDIA chip was broken but put back the original Windows 10 SSD and everything was running fine.

Unistalled PopOS NVIDIA drivers: sudo apt remove nvidia*

Installed version 465.31, downloaded from nvidia.com: chmod +x NVIDIA-Linux-x86_64-465.31.run sudo ./NVIDIA-Linux-x86_64-465.31.run

And after a reboot PopOS is running perfectly with NVIDIA GTX 1650 Max-Q graphics again.

Going to try downgrading to 465.31, downloaded from here: https://download.nvidia.com/XFree86/Linux-x86_64/465.31/

cstrahan-blueshift commented 2 years ago

Instead of downgrading to 465.31, I've decided to try upgrading to 495.46:

sudo apt remove 'nvidia-*'
chmod +x NVIDIA-Linux-x86_64-*
sudo ./NVIDIA-Linux-x86_64-495.46.run

restarted, and now I see this:

[   20.347610] NVRM: The NVIDIA probe routine was not called for 1 device(s).
[   20.381479] NVRM: This can occur when a driver such as: 
               NVRM: nouveau, rivafb, nvidiafb or rivatv 
               NVRM: was loaded and obtained ownership of the NVIDIA device(s).
[   20.381482] NVRM: Try unloading the conflicting kernel module (and/or
               NVRM: reconfigure your kernel without the conflicting
               NVRM: driver(s)), then try loading the NVIDIA kernel module
               NVRM: again.
[   20.381485] NVRM: No NVIDIA devices probed.

and

$ lsmod | grep nouveau
nouveau              2269184  1
mxm_wmi                16384  1 nouveau
wmi                    32768  2 mxm_wmi,nouveau
drm_ttm_helper         16384  1 nouveau
i2c_algo_bit           16384  2 i915,nouveau
ttm                    86016  3 drm_ttm_helper,i915,nouveau
drm_kms_helper        307200  2 i915,nouveau
drm                   606208  11 drm_kms_helper,drm_ttm_helper,i915,ttm,nouveau
video                  53248  2 i915,nouveau

going to try blacklisting nouveau and see how things go.

cstrahan-blueshift commented 2 years ago

Success! Blacklisted nouveau like so:

sudo bash -c "echo blacklist nouveau > /etc/modprobe.d/blacklist-nvidia-nouveau.conf"
sudo update-initramfs -u
sudo kernelstub

After a restart, when I run NVIDIA X Server Settings the window is no longer empty.

$ sudo lsmod | grep nvidia
nvidia_drm             65536  5
nvidia_modeset       1150976  5 nvidia_drm
nvidia              36917248  219 nvidia_modeset
drm_kms_helper        307200  2 nvidia_drm,i915
drm                   606208  10 drm_kms_helper,nvidia,nvidia_drm,i915,ttm
$ sudo dmesg | grep 'NVRM\|nvidia'
[   13.636879] nvidia: module license 'NVIDIA' taints kernel.
[   13.720349] nvidia-nvlink: Nvlink Core is being initialized, major device number 508
[   13.726867] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[   13.780378] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  495.46  Wed Oct 27 16:31:33 UTC 2021
[   13.819784] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  495.46  Wed Oct 27 16:22:48 UTC 2021
[   13.830860] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[   16.311752] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1
[   17.130636] audit: type=1400 audit(1644430025.691:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1040 comm="apparmor_parser"
[   17.130642] audit: type=1400 audit(1644430025.691:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1040 comm="apparmor_parser"

With the nvidia driver now actually working, my external monitor is working once again.

NOTE: when installing the driver, I had to explicitly opt into having DKMS setup when prompted ("No" is highlighted by default when the prompt comes up).

jacobgkau commented 2 years ago

@cstrahan-blueshift Thank you for sharing your results!

Testing with https://github.com/pop-os/nvidia-graphics-drivers/pull/134 (NVIDIA driver 510.54), I still saw the issue occur when rebooting in NVIDIA mode on oryp7. However, with https://github.com/pop-os/linux/pull/122 (Linux kernel 5.15.23), I am not currently seeing the issue occur on either driver version (although it's hard to rule anything out since it's intermittent.)

cstrahan-blueshift commented 2 years ago

Just rebooted earlier, hoping that might resolve Zoom issues that have been plaguing me for the past couple weeks or so. Attached monitor just showed a blinking _. Realized I must have been on integrated graphics somehow, so disconnected displayport, went into the settings and switched to nvidia graphics; rebooted. Same thing. Settings show I'm still on integrated graphics.

Going to try to install NVIDIA-Linux-x86_64-510.60.02.run, and see if I have any luck.

Feeling a bit embarrassed at work, as I was the one that requested my oryp7, but my productivity has been hampered by graphics driver problems :disappointed:.

cstrahan-blueshift commented 2 years ago

That appears to have worked. Though now I'm wondering:

Don't know how I'd figure that out.

cstrahan-blueshift commented 2 years ago

Actually, scratch what I last wrote. The kernel module loaded successfully, but I couldn't use my external monitor and the display settings didn't show the monitor. I think some xserver components must have got mangled when I tried to get rid of the old nvidia packages to install NVIDIA-Linux-x86_64-510.60.02.run.

Decided to try out the packaged nvidia drivers again, following what was described: https://github.com/pop-os/nvidia-graphics-drivers/pull/144#issuecomment-1088804475

Now everything is confirmed to be working again.

This is what I have installed presently:

$ apt list --installed | grep nvidia

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libnvidia-cfg1-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic]
libnvidia-common-510/impish,impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 all [installed,automatic]
libnvidia-compute-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic]
libnvidia-compute-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 i386 [installed,automatic]
libnvidia-decode-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic]
libnvidia-decode-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 i386 [installed,automatic]
libnvidia-egl-wayland1/impish,now 1:1.1.7-2build1 amd64 [installed,automatic]
libnvidia-encode-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic]
libnvidia-encode-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 i386 [installed,automatic]
libnvidia-extra-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic]
libnvidia-fbc1-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic]
libnvidia-fbc1-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 i386 [installed,automatic]
libnvidia-gl-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic]
libnvidia-gl-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 i386 [installed,automatic]
nvidia-compute-utils-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic]
nvidia-dkms-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic]
nvidia-driver-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed]
nvidia-kernel-common-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic]
nvidia-kernel-source-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic]
nvidia-settings/impish-updates,now 470.57.01-0ubuntu3.1~0.21.10.1 amd64 [installed,automatic]
nvidia-utils-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic]
xserver-xorg-video-nvidia-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic]
crubel commented 2 years ago

I meant to add these from my dmesg output:

nvidia 0000:01:00.0: can't change power state from D3cold to D0 (config space inaccessible)

NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:746)

You can see that the NV cards pcie config space was not accessible. My Oryp7 is back with System76 support folks so that they can figure it all out…

Keep looking for updates as I hear anything I’ll post it in case it’s relevant to your problems. If it is, then it’s most likely a software/firmware issue of some kind unless we have identically broken hardware, which is unlikely…

Curtis Rubel @.***

On Apr 6, 2022, at 8:41 PM, Charles Strahan @.***> wrote:  Actually, scratch what I last wrote. The kernel module loaded successfully, but I couldn't use my external monitor and the display settings didn't show the monitor. I think some xserver components must have got mangled when I tried to get rid of the old nvidia packages to install NVIDIA-Linux-x86_64-510.60.02.run.

Decided to try out the packaged nvidia drivers again, following what was described: #144 (comment)

Now everything is confirmed to be working again.

This is what I have installed presently:

$ apt list --installed | grep nvidia

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libnvidia-cfg1-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic] libnvidia-common-510/impish,impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 all [installed,automatic] libnvidia-compute-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic] libnvidia-compute-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 i386 [installed,automatic] libnvidia-decode-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic] libnvidia-decode-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 i386 [installed,automatic] libnvidia-egl-wayland1/impish,now 1:1.1.7-2build1 amd64 [installed,automatic] libnvidia-encode-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic] libnvidia-encode-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 i386 [installed,automatic] libnvidia-extra-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic] libnvidia-fbc1-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic] libnvidia-fbc1-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 i386 [installed,automatic] libnvidia-gl-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic] libnvidia-gl-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 i386 [installed,automatic] nvidia-compute-utils-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic] nvidia-dkms-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic] nvidia-driver-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed] nvidia-kernel-common-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic] nvidia-kernel-source-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic] nvidia-settings/impish-updates,now 470.57.01-0ubuntu3.1~0.21.10.1 amd64 [installed,automatic] nvidia-utils-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic] xserver-xorg-video-nvidia-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic] — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.

crubel commented 1 year ago

Hello,

My Oryp7 had similar issues, but mine turned out to look more like a hardware problem because the message logs showed it not configuring the NV card properly on the PCIe bus intermittently. I first noticed my issue because Nvidia-settings would not show the NV card present at each reboot.

Sent from my iPhone Curtis Rubel @.***

On Apr 6, 2022, at 8:41 PM, Charles Strahan @.***> wrote:

 Actually, scratch what I last wrote. The kernel module loaded successfully, but I couldn't use my external monitor and the display settings didn't show the monitor. I think some xserver components must have got mangled when I tried to get rid of the old nvidia packages to install NVIDIA-Linux-x86_64-510.60.02.run.

Decided to try out the packaged nvidia drivers again, following what was described: #144 (comment)

Now everything is confirmed to be working again.

This is what I have installed presently:

$ apt list --installed | grep nvidia

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libnvidia-cfg1-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic] libnvidia-common-510/impish,impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 all [installed,automatic] libnvidia-compute-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic] libnvidia-compute-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 i386 [installed,automatic] libnvidia-decode-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic] libnvidia-decode-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 i386 [installed,automatic] libnvidia-egl-wayland1/impish,now 1:1.1.7-2build1 amd64 [installed,automatic] libnvidia-encode-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic] libnvidia-encode-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 i386 [installed,automatic] libnvidia-extra-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic] libnvidia-fbc1-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic] libnvidia-fbc1-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 i386 [installed,automatic] libnvidia-gl-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic] libnvidia-gl-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 i386 [installed,automatic] nvidia-compute-utils-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic] nvidia-dkms-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic] nvidia-driver-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed] nvidia-kernel-common-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic] nvidia-kernel-source-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic] nvidia-settings/impish-updates,now 470.57.01-0ubuntu3.1~0.21.10.1 amd64 [installed,automatic] nvidia-utils-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic] xserver-xorg-video-nvidia-510/impish,now 510.60.02-1pop0~1649099333~21.10~aedf526 amd64 [installed,automatic] — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.

leviport commented 1 year ago

There is a firmware update in the works that should make this bug go away. I don't have an ETA at this time, but I'm hoping it will be ready soon.