pop-os / pop

A project for managing all Pop!_OS sources
https://system76.com/pop
2.48k stars 87 forks source link

nvidia error "GPU has fallen off the bus" #3363

Open esplinr opened 3 months ago

esplinr commented 3 months ago

Distribution (run cat /etc/os-release):

NAME="Pop!_OS"
VERSION="22.04 LTS"
ID=pop
ID_LIKE="ubuntu debian"
PRETTY_NAME="Pop!_OS 22.04 LTS"
VERSION_ID="22.04"
HOME_URL="https://pop.system76.com"
SUPPORT_URL="https://support.system76.com"
BUG_REPORT_URL="https://github.com/pop-os/pop/issues"
PRIVACY_POLICY_URL="https://system76.com/privacy"
VERSION_CODENAME=jammy
UBUNTU_CODENAME=jammy
LOGO=distributor-logo-pop-os

Related Application and/or Package Version (run apt policy $PACKAGE NAME):

From NVIDIA Settings: NVIDIA Driver Version: 555.58.02 From apt search system76 |grep installed: system76-driver-nvidia/jammy,jammy,now 20.04.94~1723838773~22.04~8237cd8 all [installed] From flatpak list: nvidia-555-58-02 org.freedesktop.Platform.GL32.nvidia-555-58-02 1.4 user

uname -a
Linux richard 6.9.3-76060903-generic #202405300957~1721174657~22.04~abb7c06 SMP PREEMPT_DYNAMIC Wed J x86_64 x86_64 x86_64 GNU/Linux

Issue/Bug Description: About half of the time I return to my computer after a break, the computer refuses to wake and the fans are going at full blast.

The only two times I checked the logs from before the reboot, they ended with these lines:

Aug 21 21:03:38.740382 richard kernel: workqueue: nv_drm_handle_hotplug_event [nvidia_drm] hogged CPU for >10000us 7 times, consider switching to WQ_UNBOUND
Aug 21 21:04:12.444523 richard kernel: snd_hda_intel 0000:01:00.1: Unable to change power state from D0 to D3hot, device inaccessible
Aug 21 21:04:12.672363 richard kernel: NVRM: GPU at PCI:0000:01:00: GPU-58eb6437-6614-ceb3-7b75-a8316586b521
Aug 21 21:04:12.672560 richard kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Aug 21 21:04:12.672615 richard kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Aug 21 21:04:13.243482 richard kernel: NVRM: Error in service of callback 
Aug 21 21:04:34.378353 richard kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c67e:6:0:0x0000000f
Aug 21 21:04:34.378391 richard kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c67e:4:0:0x0000000f

Steps to reproduce (if you know): Leave the computer for more than 10 minutes, and it happens about 50% of the time.

I thought it was related to #3313 because it correlates with a suspend, but I've had it happen twice when the screen blanks but before the automatic suspend should have happened.

I've also had a couple of times where I jiggled the mouse and it appeared to recover correctly from suspend, but I didn't proceed to log back in and the machine hung with the fan at full blast.

Expected behavior: The computer should wake up from a blank screen or suspend.

Other Notes: My research suggests that previous NVIDIA drivers had a bug that showed the similar behavior when the GPU entered a low powered state. My problem does seem correlated with when the machine is idle and reducing power consumption.

cstrahan commented 2 days ago

https://download.nvidia.com/XFree86/Linux-x86_64/560.35.03/README/knownissues.html

Driver fails to initialize with some versions of RHEL 8

Some versions of Red Hat Enterprise Linux 8 kernels have a bug that causes driver initialization to fail with an error such as:

NVRM: Xid (PCI:0000:09:00): 79, pid=2172, GPU has fallen off the bus.
NVRM: GPU 0000:09:00.0: GPU has fallen off the bus.
NVRM: GPU 0000:09:00.0: RmInitAdapter failed! (0x26:0x65:1239)
NVRM: GPU 0000:09:00.0: rm_init_adapter failed, device minor number 0

See the Red Hat knowledge base article https://access.redhat.com/solutions/5825061 to find the specific affected and fixed kernel versions.

The article is paywalled, so I can't see the details.

Though my error looks a bit different (but maybe the same cause?):

Nov 22 15:39:14 cstrahan-pop-os kernel: NVRM: GPU at PCI:0000:01:00: GPU-456ad6c8-6097-2bca-fa9d-6b8f7faff040
Nov 22 15:39:14 cstrahan-pop-os kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Nov 22 15:39:14 cstrahan-pop-os kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Nov 22 15:39:14 cstrahan-pop-os kernel: NVRM: GPU0 GSP RPC buffer contains function 78 (DUMP_PROTOBUF_COMPONENT) and data 0x0000000000000000 0x0000000000000000.
Nov 22 15:39:14 cstrahan-pop-os kernel: NVRM: GPU0 RPC history (CPU -> GSP):
Nov 22 15:39:14 cstrahan-pop-os kernel: NVRM:     entry function                   data0              data1              ts_start           ts_end             duration actively_polling
Nov 22 15:39:14 cstrahan-pop-os kernel: NVRM:      0    76   GSP_RM_CONTROL        0x000000002080a7d7 0x0000000000000002 0x0006278736d251d8 0x0000000000000000          y
Nov 22 15:39:14 cstrahan-pop-os kernel: NVRM:     -1    76   GSP_RM_CONTROL        0x000000002080a7d7 0x0000000000000002 0x0006278736843251 0x00062787368433c0    367us  
Nov 22 15:39:14 cstrahan-pop-os kernel: NVRM:     -2    76   GSP_RM_CONTROL        0x000000002080a7d7 0x0000000000000002 0x0006278736361250 0x00062787363613d0    384us  
Nov 22 15:39:14 cstrahan-pop-os kernel: NVRM:     -3    76   GSP_RM_CONTROL        0x000000002080a7d7 0x0000000000000002 0x0006278735e7f23b 0x0006278735e7f4a4    617us  
Nov 22 15:39:14 cstrahan-pop-os kernel: NVRM:     -4    76   GSP_RM_CONTROL        0x000000002080a7d7 0x0000000000000002 0x000627873599d258 0x000627873599d36e    278us  
Nov 22 15:39:14 cstrahan-pop-os kernel: NVRM:     -5    76   GSP_RM_CONTROL        0x000000002080a7d7 0x0000000000000002 0x00062787354bb22e 0x00062787354bb7bc   1422us  
Nov 22 15:39:14 cstrahan-pop-os kernel: NVRM:     -6    76   GSP_RM_CONTROL        0x000000002080a7d7 0x0000000000000002 0x0006278734fd91cc 0x0006278734fd9734   1384us  
Nov 22 15:39:14 cstrahan-pop-os kernel: NVRM:     -7    76   GSP_RM_CONTROL        0x000000002080a7d7 0x0000000000000002 0x0006278734af7252 0x0006278734af73c3    369us  
Nov 22 15:39:14 cstrahan-pop-os kernel: NVRM: GPU0 RPC event history (CPU <- GSP):
Nov 22 15:39:14 cstrahan-pop-os kernel: NVRM:     entry function                   data0              data1              ts_start           ts_end             duration during_incomplete_rpc
Nov 22 15:39:14 cstrahan-pop-os kernel: NVRM:      0    4099 POST_EVENT            0x0000000000000000 0x0000000000000000 0x0006278736c7d831 0x0006278736c7d832      1us  
Nov 22 15:39:14 cstrahan-pop-os kernel: NVRM:     -1    4099 POST_EVENT            0x0000000000000000 0x0000000000000000 0x0006278736c7d828 0x0006278736c7d830      8us  
Nov 22 15:39:14 cstrahan-pop-os kernel: NVRM:     -2    4099 POST_EVENT            0x0000000000000000 0x0000000000000000 0x0006278736c7b99c 0x0006278736c7b99e      2us  
Nov 22 15:39:14 cstrahan-pop-os kernel: NVRM:     -3    4099 POST_EVENT            0x0000000000000000 0x0000000000000000 0x0006278736c7b996 0x0006278736c7b99c      6us  
Nov 22 15:39:14 cstrahan-pop-os kernel: NVRM:     -4    4099 POST_EVENT            0x0000000000000000 0x0000000000000000 0x0006278736c79d1a 0x0006278736c79d1b      1us  
Nov 22 15:39:14 cstrahan-pop-os kernel: NVRM:     -5    4099 POST_EVENT            0x0000000000000000 0x0000000000000000 0x0006278736c79d14 0x0006278736c79d19      5us  
Nov 22 15:39:14 cstrahan-pop-os kernel: NVRM:     -6    4099 POST_EVENT            0x0000000000000000 0x0000000000000000 0x0006278736c7825d 0x0006278736c7825e      1us  
Nov 22 15:39:14 cstrahan-pop-os kernel: NVRM:     -7    4099 POST_EVENT            0x0000000000000000 0x0000000000000000 0x0006278736c78258 0x0006278736c7825c      4us  
Nov 22 15:39:14 cstrahan-pop-os kernel: CPU: 15 PID: 1126 Comm: nv_queue Tainted: P           OE      6.9.3-76060903-generic #202405300957~1732141768~22.04~f2697e1
Nov 22 15:39:14 cstrahan-pop-os kernel: Hardware name: System76 Gazelle/Gazelle, BIOS 2021-09-30_14b8a6e 09/28/2021
Nov 22 15:39:14 cstrahan-pop-os kernel: Call Trace:
Nov 22 15:39:14 cstrahan-pop-os kernel:  <TASK>
Nov 22 15:39:14 cstrahan-pop-os kernel:  dump_stack_lvl+0x76/0xa0
Nov 22 15:39:14 cstrahan-pop-os kernel:  dump_stack+0x10/0x20
Nov 22 15:39:14 cstrahan-pop-os kernel:  os_dump_stack+0xe/0x20 [nvidia]
Nov 22 15:39:14 cstrahan-pop-os kernel:  _nv012948rm+0x2c5/0x590 [nvidia]
Nov 22 15:39:14 cstrahan-pop-os kernel: WARNING: kernel stack frame pointer at 00000000a54725e7 in nv_queue:1126 has bad value 000000008b7fef06
Nov 22 15:39:14 cstrahan-pop-os kernel: unwind stack type:0 next_sp:0000000000000000 mask:0x2 graph_idx:0
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000c2ebc260: ffffb117c1b3bb78 (0xffffb117c1b3bb78)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000001a8173dc: ffffffff9fe5af39 (show_trace_log_lvl+0x269/0x420)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000fb4b1676: ffffffffa1993e31 (linux_banner+0x3f2db1/0x40b560)
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000087d919e6: ffff995056202f40 (0xffff995056202f40)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000637eabcf: ffffffffa19c6821 (SIGMA2+0x19de1/0x142400)
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000036659842: ffffb117c1b3bbd0 (0xffffb117c1b3bbd0)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000003f539546: 000000000000000a (0xa)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000003cedac3a: 0000000000000002 (0x2)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000a9ac5003: 0000000000000001 (0x1)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000871ba877: ffffb117c1b38000 (0xffffb117c1b38000)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000008f5368ad: ffffb117c1b3c000 (0xffffb117c1b3c000)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000005c4cdb7a: 0000000000000000 ...
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000002b4a2ebe: ffffb117c1b38000 (0xffffb117c1b38000)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000e402852a: ffffb117c1b3c000 (0xffffb117c1b3c000)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000001568cbd5: 0000000000000000 ...
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000000cd054b7: 0000000000000002 (0x2)
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000061da1395: ffff995056202f40 (0xffff995056202f40)
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000006201816: 0000000000000000 ...
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000003a91b4f1: 0000000000000001 (0x1)
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000056db09c6: ffffb117c1b3bbc8 (0xffffb117c1b3bbc8)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000000061df54: ffffb117c1b3ba78 (0xffffb117c1b3ba78)
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000053dcbcb2: ffffffffc2533ca5 (_nv012948rm+0x2c5/0x590 [nvidia])
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000005e943300: 0000000000000000 ...
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000a30d4c8d: 65179c2b5fd17000 (0x65179c2b5fd17000)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000008e8e8b77: 0000000000000246 (0x246)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000002a08ed9f: ffffffffa19c6821 (SIGMA2+0x19de1/0x142400)
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000090690bb7: ffff995051684008 (0xffff995051684008)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000aad58055: 000000000000004c (0x4c)
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000041c14843: 000000000000000a (0xa)
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000003b3db9f: ffffb117c1b3bb88 (0xffffb117c1b3bb88)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000e3678429: ffffffff9fe5b210 (show_stack+0x20/0x70)
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000062f48df1: ffffb117c1b3bba8 (0xffffb117c1b3bba8)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000c5bf6249: ffffffffa0f76ce6 (dump_stack_lvl+0x76/0xa0)
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000081cebf60: ffff99506bf00008 (0xffff99506bf00008)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000eaa62432: ffff99506bec0008 (0xffff99506bec0008)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000007568d299: ffffb117c1b3bbb8 (0xffffb117c1b3bbb8)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000761a9fac: ffffffffa0f76d30 (dump_stack+0x10/0x20)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000006cb5a651: ffffb117c1b3bbc8 (0xffffb117c1b3bbc8)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000007cef089a: ffffffffc1ee1ace (os_dump_stack+0xe/0x20 [nvidia])
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000a54725e7: ffff995173a5ac50 (0xffff995173a5ac50)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000af82ca42: ffffffffc2533ca5 (_nv012948rm+0x2c5/0x590 [nvidia])
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000005d26a12a: 000000002080a7d7 (0x2080a7d7)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000e9a1c67c: 0000000000000002 (0x2)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000dc434d85: ffff995051684008 (0xffff995051684008)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000c96b27a9: 0000000000000000 ...
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000006016cac0: ffff99506bec0008 (0xffff99506bec0008)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000001869bce1: 0000000000000000 ...
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000b7cd814f: 000000000000004c (0x4c)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000e289d14f: ffffffffc2856667 (_nv012865rm+0x77/0x330 [nvidia])
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000018f5f6ed: ffff99506bf21050 (0xffff99506bf21050)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000a0a4d384: ffff995173a5adb0 (0xffff995173a5adb0)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000491689a9: ffff99506bec0008 (0xffff99506bec0008)
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000041d735d7: 000000002080a7d7 (0x2080a7d7)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000569b9b3a: ffff995051684008 (0xffff995051684008)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000004f4fce06: ffffffffc28727df (_nv048628rm+0x49f/0x7f0 [nvidia])
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000036d552bf: ffff995173a5ae98 (0xffff995173a5ae98)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000004a13bd1c: ffff99506bec0008 (0xffff99506bec0008)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000008e283d72: ffffffffc2c96120 (_nv000464rm+0xaf0/0xfffffffffff119d0 [nvidia])
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000059733b3d: ffff9950ac4a8020 (0xffff9950ac4a8020)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000002020c319: ffff995173a5ae98 (0xffff995173a5ae98)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000cc6223d1: ffffffffc1f8ddd1 (_nv048204rm+0xf1/0x1f0 [nvidia])
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000c77a70e4: 0000000000000000 ...
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000b54cb6b9: ffff995173a5adb0 (0xffff995173a5adb0)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000a82e3e3f: ffffffffc2c96120 (_nv000464rm+0xaf0/0xfffffffffff119d0 [nvidia])
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000646a726f: ffffffffc28ad760 (_nv047909rm+0xd0/0x1b0 [nvidia])
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000004a183923: ffff995173a5ad80 (0xffff995173a5ad80)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000002d3b0571: 0000000000000000 ...
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000003d663569: ffffffffc2c96120 (_nv000464rm+0xaf0/0xfffffffffff119d0 [nvidia])
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000189f274d: ffff995173a5ae98 (0xffff995173a5ae98)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000be97e31e: ffff995173a5ae60 (0xffff995173a5ae60)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000be24c688: ffffffffc28aaa2f (_nv049933rm+0x3ff/0x500 [nvidia])
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000007348d9de: 0000000000000002 (0x2)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000b8cfdada: 0000000000000002 (0x2)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000003fe4a053: ffff995173a5affe (0xffff995173a5affe)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000b74ed744: 00000000c1d0000b (0xc1d0000b)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000405bd38f: 000000002080a7d7 (0x2080a7d7)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000aceb672f: ffffffffc1f80c7e (_nv014741rm+0x42e/0x690 [nvidia])
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000011543ba8: ffff99505620e000 (0xffff99505620e000)
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000044ae1145: ffffffffc2c95f40 (_nv000464rm+0x910/0xfffffffffff119d0 [nvidia])
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000c405154e: ffff99505620e000 (0xffff99505620e000)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000ac29eca0: ffff995056202f40 (0xffff995056202f40)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000009b2acbab: ffff99505620e740 (0xffff99505620e740)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000008c542605: ffffffffc1f815d9 (_nv048046rm+0x29/0x30 [nvidia])
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000037b1348b: ffffffffc2c95f40 (_nv000464rm+0x910/0xfffffffffff119d0 [nvidia])
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000005b1f9049: ffffffffc2c95ff8 (_nv000464rm+0x9c8/0xfffffffffff119d0 [nvidia])
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000001efd9716: ffff995173a5aff0 (0xffff995173a5aff0)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000cc6a0747: ffffffffc2aa7a50 (_nv000673rm+0x60/0xa1 [nvidia])
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000004e495556: ffffb117c1b3bd58 (0xffffb117c1b3bd58)
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000009e7d991: ffff9950647aa808 (0xffff9950647aa808)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000007269fae2: ffff99506bec0008 (0xffff99506bec0008)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000008e8afb24: ffffffffc2aa7f4e (_nv052829rm+0x3e/0x159 [nvidia])
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000085875629: ffff995041cbff68 (0xffff995041cbff68)
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000038b25998: ffff9950ace749c8 (0xffff9950ace749c8)
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000091d0412a: ffffffffc284e520 (_nv052686rm+0x110/0x110 [nvidia])
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000b22b4510: ffffffffc284e54c (_nv015199rm+0x2c/0x50 [nvidia])
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000008b6412fa: ffff995041cbff68 (0xffff995041cbff68)
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000094affc8a: ffff995041cbff68 (0xffff995041cbff68)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000002534b593: ffffb117c1b3bea8 (0xffffb117c1b3bea8)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000005062195d: ffffffffc285054a (_nv052884rm+0x1a/0x40 [nvidia])
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000006004b015: ffff995173a58000 (0xffff995173a58000)
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000045eb1809: ffffffffc2a99690 (_nv015200rm+0x20/0x50 [nvidia])
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000000f60f7dd: ffff995052fd4d48 (0xffff995052fd4d48)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000c17e93cd: ffffffffc2a9be03 (rm_execute_work_item+0x113/0x170 [nvidia])
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000622ac888: 01ffffff00000001 (0x1ffffff00000001)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000de6afff4: 0000080000000001 (0x80000000001)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000ad044b7d: 0000000000000466 (0x466)
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000063f754bb: 000000010002b9d4 (0x10002b9d4)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000005ed8283d: 000f423582740000 (0xf423582740000)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000c1bec53e: 000f423670df2800 (0xf423670df2800)
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000035574f7a: 000f423670df2800 (0xf423670df2800)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000c740b6f2: 000f4235f9a99400 (0xf4235f9a99400)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000abad0f36: 0000000000000000 ...
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000081fe48d4: 000001200000000f (0x1200000000f)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000eb08096b: 0000000000000466 (0x466)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000007d5a3596: 0000000000000000 ...
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000007005d036: fffffff000000000 (0xfffffff000000000)
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000016042997: ffffffffc2cc9450 (_nv046641rm+0x90/0xffffffffffeddc40 [nvidia])
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000002ab99976: 0000000000000010 (0x10)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000848bf655: ffff995173a58000 (0xffff995173a58000)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000006fdbf42d: ffff995052fd4d48 (0xffff995052fd4d48)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000004552d4f8: ffff99505620e748 (0xffff99505620e748)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000b76791c8: ffffffffc1edfc4c (os_execute_work_item+0x6c/0x90 [nvidia])
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000b69e380e: ffff9950f65aa0c0 (0xffff9950f65aa0c0)
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000083a84d83: ffff99505620e730 (0xffff99505620e730)
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000037329954: ffffb117c1b3bee0 (0xffffb117c1b3bee0)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000001cf6c75d: ffffffffc1ee437e (_main_loop+0x7e/0x140 [nvidia])
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000002439e494: ffff995042d69c00 (0xffff995042d69c00)
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000023d98e39: ffff995056202f40 (0xffff995056202f40)
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000011c34886: ffffb117c1093620 (0xffffb117c1093620)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000005005dbac: ffff995064559a00 (0xffff995064559a00)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000000b8c268b: ffffffffc1ee4300 (__pfx__main_loop+0x10/0x10 [nvidia])
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000d44b98ff: ffffb117c1b3bf20 (0xffffb117c1b3bf20)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000b77b95ff: ffffffff9ff3a461 (kthread+0xe1/0x110)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000eef4d6e3: ffff99505620e730 (0xffff99505620e730)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000097e77d9: ffffb117c1b3bf58 (0xffffb117c1b3bf58)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000c53d06fd: ffffffff9ff3a380 (__pfx_kthread+0x10/0x10)
Nov 22 15:39:14 cstrahan-pop-os kernel: 0000000048d1fb15: ffff995042d69c00 (0xffff995042d69c00)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000d882dee5: 0000000000000000 ...
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000006cd5e345: ffffb117c1b3bf48 (0xffffb117c1b3bf48)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000008847ee97: ffffffff9fe68194 (ret_from_fork+0x44/0x70)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000000590a677: ffffffff9ff3a380 (__pfx_kthread+0x10/0x10)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000005514a715: ffff995042d69c00 (0xffff995042d69c00)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000006bc56062: 0000000000000000 ...
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000ba2922c5: ffffb117c1b3bf59 (0xffffb117c1b3bf59)
Nov 22 15:39:14 cstrahan-pop-os kernel: 00000000249d08b7: ffffffff9fe0516a (ret_from_fork_asm+0x1a/0x30)
Nov 22 15:39:14 cstrahan-pop-os kernel: 000000000e241620: 0000000000000000 ...
Nov 22 15:39:14 cstrahan-pop-os kernel:  ? _nv012865rm+0x77/0x330 [nvidia]
Nov 22 15:39:14 cstrahan-pop-os kernel:  ? _nv048628rm+0x49f/0x7f0 [nvidia]
Nov 22 15:39:14 cstrahan-pop-os kernel:  ? _nv048204rm+0xf1/0x1f0 [nvidia]
Nov 22 15:39:14 cstrahan-pop-os kernel:  ? _nv047909rm+0xd0/0x1b0 [nvidia]
Nov 22 15:39:14 cstrahan-pop-os kernel:  ? _nv049933rm+0x3ff/0x500 [nvidia]
Nov 22 15:39:14 cstrahan-pop-os kernel:  ? _nv014741rm+0x42e/0x690 [nvidia]
Nov 22 15:39:14 cstrahan-pop-os kernel:  ? _nv048046rm+0x29/0x30 [nvidia]
Nov 22 15:39:14 cstrahan-pop-os kernel:  ? _nv000673rm+0x60/0xa1 [nvidia]
Nov 22 15:39:14 cstrahan-pop-os kernel:  ? _nv052829rm+0x3e/0x159 [nvidia]
Nov 22 15:39:14 cstrahan-pop-os kernel:  ? _nv052686rm+0x110/0x110 [nvidia]
Nov 22 15:39:14 cstrahan-pop-os kernel:  ? _nv015199rm+0x2c/0x50 [nvidia]
Nov 22 15:39:14 cstrahan-pop-os kernel:  ? _nv052884rm+0x1a/0x40 [nvidia]
Nov 22 15:39:14 cstrahan-pop-os kernel:  ? _nv015200rm+0x20/0x50 [nvidia]
Nov 22 15:39:14 cstrahan-pop-os kernel:  ? rm_execute_work_item+0x113/0x170 [nvidia]
Nov 22 15:39:14 cstrahan-pop-os kernel:  ? os_execute_work_item+0x6c/0x90 [nvidia]
Nov 22 15:39:14 cstrahan-pop-os kernel:  ? _main_loop+0x7e/0x140 [nvidia]
Nov 22 15:39:14 cstrahan-pop-os kernel:  ? __pfx__main_loop+0x10/0x10 [nvidia]
Nov 22 15:39:14 cstrahan-pop-os kernel:  ? kthread+0xe1/0x110
Nov 22 15:39:14 cstrahan-pop-os kernel:  ? __pfx_kthread+0x10/0x10
Nov 22 15:39:14 cstrahan-pop-os kernel:  ? ret_from_fork+0x44/0x70
Nov 22 15:39:14 cstrahan-pop-os kernel:  ? __pfx_kthread+0x10/0x10
Nov 22 15:39:14 cstrahan-pop-os kernel:  ? ret_from_fork_asm+0x1a/0x30
Nov 22 15:39:14 cstrahan-pop-os kernel:  </TASK>
% cat /proc/driver/nvidia/params
ResmanDebugLevel: 4294967295
RmLogonRC: 1
ModifyDeviceFiles: 1
DeviceFileUID: 0
DeviceFileGID: 0
DeviceFileMode: 438
InitializeSystemMemoryAllocations: 1
UsePageAttributeTable: 4294967295
EnableMSI: 1
EnablePCIeGen3: 0
MemoryPoolSize: 0
KMallocHeapMaxSize: 0
VMallocHeapMaxSize: 0
IgnoreMMIOCheck: 0
TCEBypassMode: 0
EnableStreamMemOPs: 0
EnableUserNUMAManagement: 1
NvLinkDisable: 0
RmProfilingAdminOnly: 1
PreserveVideoMemoryAllocations: 1
EnableS0ixPowerManagement: 0
S0ixPowerManagementVideoMemoryThreshold: 256
DynamicPowerManagement: 3
DynamicPowerManagementVideoMemoryThreshold: 200
RegisterPCIDriver: 1
EnablePCIERelaxedOrderingMode: 0
EnableResizableBar: 0
EnableGpuFirmware: 18
EnableGpuFirmwareLogs: 2
EnableDbgBreakpoint: 0
OpenRmEnableUnsupportedGpus: 1
DmaRemapPeerMmio: 1
ImexChannelCount: 2048
CreateImexChannel0: 0
RegistryDwords: ""
RegistryDwordsPerDevice: ""
RmMsg: ""
GpuBlacklist: ""
TemporaryFilePath: ""
ExcludedGpus: ""
cstrahan commented 1 day ago

Created /etc/modprobe.d/cstrahan-nvidia.conf (named after myself so I can easily spot that I put it there):

options nvidia NVreg_PreserveVideoMemoryAllocations=0

And after a reboot, haven't had another crash (yet).

Vetpetmon commented 10 hours ago

So as I was confirming this was a driver issue by switching to a Windows OS, I had a random system reboot on the very computer reporting problems with the Linux NVIDIA drivers.

minidump: 112424-9281-01.dmp

Analysis: WinDbg_Output.txt

Eerily similar to my issue with Linux, both being timeout errors. I have looked into Error 116, and found that it is related to power states, PSU and GPU are both parts to blame. "but when I see a Video_TDR, I start with a DRIVER.." Source

The cause for this is, once again, going from high usage (running a moderately to extremely GPU-demanding game) and then to ultra-low (Only having an IDE or web browser open; or putting the game at lowest settings to achieve <2.5% GPU usage)

Seems to affect GTX 16 series, RTX 20 series, and newer. OS-independent. In short: Turing architecture-GPUs and later are suddenly having these issues. Nothing like this has been reported from pre-Turing GPUs.