pop-os / pop

A project for managing all Pop!_OS sources
https://system76.com/pop
2.48k stars 87 forks source link

nvidia error "GPU has fallen off the bus" #3363

Open esplinr opened 3 months ago

esplinr commented 3 months ago

Distribution (run cat /etc/os-release):

NAME="Pop!_OS"
VERSION="22.04 LTS"
ID=pop
ID_LIKE="ubuntu debian"
PRETTY_NAME="Pop!_OS 22.04 LTS"
VERSION_ID="22.04"
HOME_URL="https://pop.system76.com"
SUPPORT_URL="https://support.system76.com"
BUG_REPORT_URL="https://github.com/pop-os/pop/issues"
PRIVACY_POLICY_URL="https://system76.com/privacy"
VERSION_CODENAME=jammy
UBUNTU_CODENAME=jammy
LOGO=distributor-logo-pop-os

Related Application and/or Package Version (run apt policy $PACKAGE NAME):

From NVIDIA Settings: NVIDIA Driver Version: 555.58.02 From apt search system76 |grep installed: system76-driver-nvidia/jammy,jammy,now 20.04.94~1723838773~22.04~8237cd8 all [installed] From flatpak list: nvidia-555-58-02 org.freedesktop.Platform.GL32.nvidia-555-58-02 1.4 user

uname -a
Linux richard 6.9.3-76060903-generic #202405300957~1721174657~22.04~abb7c06 SMP PREEMPT_DYNAMIC Wed J x86_64 x86_64 x86_64 GNU/Linux

Issue/Bug Description: About half of the time I return to my computer after a break, the computer refuses to wake and the fans are going at full blast.

The only two times I checked the logs from before the reboot, they ended with these lines:

Aug 21 21:03:38.740382 richard kernel: workqueue: nv_drm_handle_hotplug_event [nvidia_drm] hogged CPU for >10000us 7 times, consider switching to WQ_UNBOUND
Aug 21 21:04:12.444523 richard kernel: snd_hda_intel 0000:01:00.1: Unable to change power state from D0 to D3hot, device inaccessible
Aug 21 21:04:12.672363 richard kernel: NVRM: GPU at PCI:0000:01:00: GPU-58eb6437-6614-ceb3-7b75-a8316586b521
Aug 21 21:04:12.672560 richard kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Aug 21 21:04:12.672615 richard kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Aug 21 21:04:13.243482 richard kernel: NVRM: Error in service of callback 
Aug 21 21:04:34.378353 richard kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c67e:6:0:0x0000000f
Aug 21 21:04:34.378391 richard kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c67e:4:0:0x0000000f

Steps to reproduce (if you know): Leave the computer for more than 10 minutes, and it happens about 50% of the time.

I thought it was related to #3313 because it correlates with a suspend, but I've had it happen twice when the screen blanks but before the automatic suspend should have happened.

I've also had a couple of times where I jiggled the mouse and it appeared to recover correctly from suspend, but I didn't proceed to log back in and the machine hung with the fan at full blast.

Expected behavior: The computer should wake up from a blank screen or suspend.

Other Notes: My research suggests that previous NVIDIA drivers had a bug that showed the similar behavior when the GPU entered a low powered state. My problem does seem correlated with when the machine is idle and reducing power consumption.

alspitz commented 3 months ago

Had this happen on Ubuntu 22.04 (GPU has fallen off the bus) immediately after upgrading from 550 to 555 and rebooting (mistake!).

mdbartos commented 3 months ago

I have been having a similar issue since applying an update from Pop Shop on 8/22. Before this update everything was running fine.

Behavior

The computer will randomly freeze, usually about 10-15 minutes after booting, and the fans will start running at full blast even under no workload. System does not respond to mouse input and I have to reboot either by holding down the power button or using Alt+SysRq+b.

Output of journalctl during the crash is as follows:

Aug 25 03:44:51 balthasar kernel: pcieport 0000:00:01.0: AER: Multiple Correctable error message received from 0000:01:00.0
Aug 25 03:44:51 balthasar kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
Aug 25 03:44:51 balthasar kernel: pcieport 0000:00:01.0:   device [8086:a70d] error status/mask=00008001/00002000
Aug 25 03:44:51 balthasar kernel: pcieport 0000:00:01.0:    [ 0] RxErr                  (First)
Aug 25 03:44:51 balthasar kernel: pcieport 0000:00:01.0:    [15] HeaderOF
Aug 25 03:44:51 balthasar kernel: pcieport 0000:00:01.0: AER: Multiple Uncorrectable (Fatal) error message received from 0000:00:01.0
Aug 25 03:44:51 balthasar kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
Aug 25 03:44:51 balthasar kernel: pcieport 0000:00:01.0:   device [8086:a70d] error status/mask=00040000/00010000
Aug 25 03:44:51 balthasar kernel: pcieport 0000:00:01.0:    [18] MalfTLP                (First)
Aug 25 03:44:51 balthasar kernel: pcieport 0000:00:01.0: AER:   TLP Header: 40000020 010000ff fff47880 00000000
Aug 25 03:44:51 balthasar kernel: nvidia 0000:01:00.0: AER: can't recover (no error_detected callback)
Aug 25 03:44:51 balthasar kernel: snd_hda_intel 0000:01:00.1: AER: can't recover (no error_detected callback)
Aug 25 03:44:51 balthasar kernel: NVRM: GPU at PCI:0000:01:00: GPU-9614e587-880d-7880-9895-2a74c029fbbe
Aug 25 03:44:51 balthasar kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Aug 25 03:44:51 balthasar kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Aug 25 03:44:51 balthasar kernel: NVRM: A GPU crash dump has been created. If possible, please run
                                  NVRM: nvidia-bug-report.sh as root to collect this data before
                                  NVRM: the NVIDIA kernel module is unloaded.
Aug 25 03:44:52 balthasar kernel: pcieport 0000:00:01.0: broken device, retraining non-functional downstream link at 2.5GT/s
Aug 25 03:44:53 balthasar kernel: pcieport 0000:00:01.0: retraining failed
Aug 25 03:44:55 balthasar kernel: pcieport 0000:00:01.0: broken device, retraining non-functional downstream link at 2.5GT/s

System Info

cat /etc/os-release

NAME="Pop!_OS"
VERSION="22.04 LTS"
ID=pop
ID_LIKE="ubuntu debian"
PRETTY_NAME="Pop!_OS 22.04 LTS"
VERSION_ID="22.04"
HOME_URL="https://pop.system76.com"
SUPPORT_URL="https://support.system76.com"
BUG_REPORT_URL="https://github.com/pop-os/pop/issues"
PRIVACY_POLICY_URL="https://system76.com/privacy"
VERSION_CODENAME=jammy
UBUNTU_CODENAME=jammy
LOGO=distributor-logo-pop-os

uname -a

Linux balthasar 6.9.3-76060903-generic #202405300957~1721174657~22.04~abb7c06 SMP PREEMPT_DYNAMIC Wed J x86_64 x86_64 x86_64 GNU/Linux

NVIDIA Driver Version is 555.58.02. GPU is a 16 GB NVIDIA GeForce RTX 4070 Ti Super.

Suspected cause

The contents of the update that seem to have broken the system are as follows (lines related to some packages omitted):

Start-Date: 2024-08-22  17:36:12
Commandline: packagekit role='update-packages'
Requested-By: akagi (1000)
Upgrade: ...
pop-launcher:amd64 (1.2.3~1722960871~22.04~c994240, 1.2.3~1723669139~22.04~6a1b8b9),
...
popsicle:amd64 (1.3.3~1721773298~22.04~3a87912, 1.3.3~1724174665~22.04~a473f89),
system76-io-dkms:amd64 (1.0.3~1707324885~22.04~3dd4c32, 1.0.4~1724333961~22.04~968f68c),
pop-gtk-theme:amd64 (5.5.1~1686085983~22.04~190b5cc, 5.5.1~1723827328~22.04~25ea85d),
libwayland-cursor0:amd64 (1.20.0-1ubuntu0.1, 1.22.0-2pop1~1722453806~22.04~accf54c),
libwayland-cursor0:i386 (1.20.0-1ubuntu0.1, 1.22.0-2pop1~1722453806~22.04~accf54c),
...
system76-power:amd64 (1.2.0~1722536955~22.04~9894c79, 1.2.1~1724333998~22.04~8b9184c),
busybox-static:amd64 (1:1.30.1-7ubuntu3, 1:1.30.1-7ubuntu3.1),
libwayland-server0:amd64 (1.20.0-1ubuntu0.1, 1.22.0-2pop1~1722453806~22.04~accf54c),
libwayland-server0:i386 (1.20.0-1ubuntu0.1, 1.22.0-2pop1~1722453806~22.04~accf54c),
...
libcom-err2:amd64 (1.46.5-2ubuntu1.1, 1.46.5-2ubuntu1.2),
libcom-err2:i386 (1.46.5-2ubuntu1.1, 1.46.5-2ubuntu1.2),
...
pop-gnome-shell-theme:amd64 (5.5.1~1686085983~22.04~190b5cc, 5.5.1~1723827328~22.04~25ea85d),
...
busybox-initramfs:amd64 (1:1.30.1-7ubuntu3, 1:1.30.1-7ubuntu3.1),
...
popsicle-gtk:amd64 (1.3.3~1721773298~22.04~3a87912, 1.3.3~1724174665~22.04~a473f89),
...
system76-driver:amd64 (20.04.93~1722974544~22.04~bb3c2fe, 20.04.95~1724334075~22.04~12b4d15),
...
libwayland-egl1:amd64 (1.20.0-1ubuntu0.1, 1.22.0-2pop1~1722453806~22.04~accf54c),
libwayland-egl1:i386 (1.20.0-1ubuntu0.1, 1.22.0-2pop1~1722453806~22.04~accf54c),
...
libwayland-client0:amd64 (1.20.0-1ubuntu0.1, 1.22.0-2pop1~1722453806~22.04~accf54c),
libwayland-client0:i386 (1.20.0-1ubuntu0.1, 1.22.0-2pop1~1722453806~22.04~accf54c),
system76-driver-nvidia:amd64 (20.04.93~1722974544~22.04~bb3c2fe, 20.04.95~1724334075~22.04~12b4d15),
...
system76-dkms:amd64 (1.0.15~1718228158~22.04~ec10d1d, 1.0.15~1723747371~22.04~341bcde),
...
intel-microcode:amd64 (3.20240514.0ubuntu0.22.04.1, 3.20240813.0ubuntu0.22.04.2),
...
End-Date: 2024-08-22  17:37:11

Of these, I imagine the culprit is system76-driver-nvidia.

leviport commented 3 months ago

A potential workaround is to use the 550-server version until 555 is updated:

sudo apt purge ~nnvidia
sudo apt install nvidia-driver-550-server

then reboot

mdbartos commented 3 months ago

A potential workaround is to use the 550-server version until 555 is updated:

sudo apt purge ~nnvidia
sudo apt install nvidia-driver-550-server

then reboot

The first line doesn't work for me, should it be *nvidia? Or should it be ~nvidia?

leviport commented 3 months ago

Nope, ~nnvidia. I just tested it, and it should work for you.

mdbartos commented 3 months ago

OK, figured it out. The command works in bash but not zsh.

I ran the above commands and rebooted. Upon rebooting the first time, it booted into a console displaying the error message shown by @esplinr above:

nvidia 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
nvidia 0000:01:00.0: probe with driver nvidia failed with error -1

The second time I rebooted it successfully loaded Pop!_OS. I will follow up if freezes persist.

mdbartos commented 3 months ago

It was working for a while but unfortunately the freezes persist. I started getting them again after trying to watch a youtube video in chromium (which is when I first noticed the issue).

Aug 26 18:03:29 balthasar org.chromium.Chromium.desktop[7854]: [106:106:0826/180329.480828:ERROR:gl_display.cc(497)] EGL Driver message (Critical) eglInitialize: glXQueryExtensionsString returned NULL
Aug 26 18:03:29 balthasar org.chromium.Chromium.desktop[7854]: [106:106:0826/180329.480837:ERROR:gl_display.cc(767)] eglInitialize OpenGLES failed with error EGL_NOT_INITIALIZED
Aug 26 18:03:29 balthasar org.chromium.Chromium.desktop[7854]: [106:106:0826/180329.480846:ERROR:gl_display.cc(801)] Initialization of all EGL display types failed.
Aug 26 18:03:29 balthasar org.chromium.Chromium.desktop[7854]: [106:106:0826/180329.480855:ERROR:gl_ozone_egl.cc(26)] GLDisplayEGL::Initialize failed.
Aug 26 18:03:29 balthasar org.chromium.Chromium.desktop[7854]: [106:106:0826/180329.481662:ERROR:viz_main_impl.cc(166)] Exiting GPU process due to errors during initialization
Aug 26 18:06:29 balthasar geoclue[1324]: Failed to query location: Not Found
Aug 26 18:11:33 balthasar geoclue[1324]: Failed to query location: Not Found
Aug 26 18:16:05 balthasar NetworkManager[874]: <info>  [1724714165.7504] dhcp6 (wlo1): state changed new lease, address=2600:1702:3830:1c10::28
Aug 26 18:16:39 balthasar geoclue[1324]: Failed to query location: Not Found
Aug 26 18:17:01 balthasar CRON[8867]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Aug 26 18:17:01 balthasar CRON[8868]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug 26 18:17:01 balthasar CRON[8867]: pam_unix(cron:session): session closed for user root
Aug 26 18:17:17 balthasar NetworkManager[874]: <info>  [1724714237.7465] dhcp6 (enp6s0): state changed new lease, address=2600:1702:3830:1c10::39
Aug 26 18:21:44 balthasar geoclue[1324]: Failed to query location: Not Found
Aug 26 18:26:48 balthasar geoclue[1324]: Failed to query location: Not Found
Aug 26 18:27:21 balthasar gnome-shell[2667]: Can't update stage views actor <unnamed>[<MetaWindowGroup>:0x58fd886e4370] is on because it needs an allocation.
Aug 26 18:27:21 balthasar gnome-shell[2667]: Can't update stage views actor <unnamed>[<MetaWindowActorX11>:0x58fd8b0b6b40] is on because it needs an allocation.
Aug 26 18:27:21 balthasar gnome-shell[2667]: Can't update stage views actor <unnamed>[<MetaSurfaceActorX11>:0x58fd8b0bade0] is on because it needs an allocation.
Aug 26 18:28:10 balthasar systemd[2500]: app-gnome-x\x2dterminal\x2demulator-5299.scope: Consumed 5min 27.419s CPU time.
Aug 26 18:28:35 balthasar org.chromium.Chromium.desktop[7876]: [128:140:0826/182835.192379:ERROR:shared_image_manager.cc(327)] SharedImageManager::ProduceMemory: Trying to Produce a Memory representation from a>
Aug 26 18:28:37 balthasar org.chromium.Chromium.desktop[7876]: [128:140:0826/182837.692810:ERROR:shared_image_manager.cc(327)] SharedImageManager::ProduceMemory: Trying to Produce a Memory representation from a>
Aug 26 18:29:00 balthasar kernel: NVRM: GPU at PCI:0000:01:00: GPU-9614e587-880d-7880-9895-2a74c029fbbe
Aug 26 18:29:00 balthasar kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Aug 26 18:29:00 balthasar kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Aug 26 18:29:00 balthasar kernel: NVRM: A GPU crash dump has been created. If possible, please run
                                  NVRM: nvidia-bug-report.sh as root to collect this data before
                                  NVRM: the NVIDIA kernel module is unloaded.

Oddly, the nvidia-driver-555 was installed on 8/8 and the computer worked fine up until 8/22 when the additional updates were installed.

For now, I am just removing all nvidia drivers using sudo apt remove ~nnvidia and seeing if the system can stay up.

mdbartos commented 3 months ago

Same freeze occurs even with all nvidia drivers uninstalled. Not sure how to proceed at this point. Will probably need a support ticket.

Aug 26 19:08:11 balthasar kernel: nouveau 0000:01:00.0: timeout
Aug 26 19:08:11 balthasar kernel: WARNING: CPU: 4 PID: 2541 at drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmtu102.c:45 tu102_vmm_flush+0x176/0x180 [nouveau]
Aug 26 19:08:11 balthasar kernel: Modules linked in: tls rfcomm snd_seq_dummy snd_hrtimer nvme_fabrics ccm cmac algif_hash algif_skcipher af_alg zstd intel_rapl_msr intel_rapl_common intel_uncore_frequency inte>
Aug 26 19:08:11 balthasar kernel:  ecdh_generic snd_seq_device iTCO_wdt intel_pmc_bxt cfg80211 hid_multitouch mtd system76_thelio_io(OE) ecc bfq snd_timer joydev intel_cstate input_leds mei_hdcp iTCO_vendor_sup>
Aug 26 19:08:11 balthasar kernel:  pinctrl_alderlake aesni_intel crypto_simd cryptd
Aug 26 19:08:11 balthasar kernel: CPU: 4 PID: 2541 Comm: Xorg Tainted: G        W  OE      6.9.3-76060903-generic #202405300957~1721174657~22.04~abb7c06
Aug 26 19:08:11 balthasar kernel: Hardware name: System76 Thelio Mira/Thelio Mira, BIOS FJd Z5 06/12/2024
Aug 26 19:08:11 balthasar kernel: RIP: 0010:tu102_vmm_flush+0x176/0x180 [nouveau]
Aug 26 19:08:11 balthasar kernel: Code: 8b 40 10 48 8b 78 10 48 8b 5f 50 48 85 db 75 03 48 8b 1f e8 bc b5 1e f1 48 89 da 48 c7 c7 62 7a 62 c1 48 89 c6 e8 fa 1c 71 f0 <0f> 0b eb 88 e8 d1 44 83 f1 90 90 90 90 90 >
Aug 26 19:08:11 balthasar kernel: RSP: 0018:ffffb8444c82f560 EFLAGS: 00010246
Aug 26 19:08:11 balthasar kernel: RAX: 0000000000000000 RBX: ffff8bacc4580500 RCX: 0000000000000000
Aug 26 19:08:11 balthasar kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Aug 26 19:08:11 balthasar kernel: RBP: ffffb8444c82f5a8 R08: 0000000000000000 R09: 0000000000000000
Aug 26 19:08:11 balthasar kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8bacc23e8000
Aug 26 19:08:11 balthasar kernel: R13: 0000000080000001 R14: 0000000000000000 R15: 0000000000000001
Aug 26 19:08:11 balthasar kernel: FS:  00007369e09d4a80(0000) GS:ffff8bcbfee00000(0000) knlGS:0000000000000000
Aug 26 19:08:11 balthasar kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 26 19:08:11 balthasar kernel: CR2: 00007369e059d000 CR3: 00000001622d2000 CR4: 0000000000f50ef0
Aug 26 19:08:11 balthasar kernel: PKRU: 55555554
Aug 26 19:08:11 balthasar kernel: Call Trace:
Aug 26 19:08:11 balthasar kernel:  <TASK>
Aug 26 19:08:11 balthasar kernel:  ? show_regs+0x6c/0x80
Aug 26 19:08:11 balthasar kernel:  ? __warn+0x88/0x140
Aug 26 19:08:11 balthasar kernel:  ? tu102_vmm_flush+0x176/0x180 [nouveau]
Aug 26 19:08:11 balthasar kernel:  ? report_bug+0x182/0x1b0
Aug 26 19:08:11 balthasar kernel:  ? handle_bug+0x46/0x90
Aug 26 19:08:11 balthasar kernel:  ? exc_invalid_op+0x18/0x80
Aug 26 19:08:11 balthasar kernel:  ? asm_exc_invalid_op+0x1b/0x20
Aug 26 19:08:11 balthasar kernel:  ? tu102_vmm_flush+0x176/0x180 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nvkm_vmm_iter.constprop.0+0x3d5/0x7d0 [nouveau]
Aug 26 19:08:11 balthasar kernel:  ? __pfx_gp100_vmm_pgt_dma+0x10/0x10 [nouveau]
Aug 26 19:08:11 balthasar kernel:  ? __pfx_nvkm_vmm_ref_ptes+0x10/0x10 [nouveau]
Aug 26 19:08:11 balthasar kernel:  ? __pfx_gp100_vmm_pgt_dma+0x10/0x10 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nvkm_vmm_ptes_get_map+0x103/0x140 [nouveau]
Aug 26 19:08:11 balthasar kernel:  ? __pfx_nvkm_vmm_ref_ptes+0x10/0x10 [nouveau]
Aug 26 19:08:11 balthasar kernel:  ? __pfx_gp100_vmm_pgt_dma+0x10/0x10 [nouveau]
Aug 26 19:08:11 balthasar kernel:  ? nvkm_vmm_map_valid+0xcb/0x210 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nvkm_vmm_map_locked+0x228/0x3c0 [nouveau]
Aug 26 19:08:11 balthasar kernel:  ? nvkm_ioctl_new+0x1cc/0x2e0 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nvkm_vmm_map+0x9e/0x100 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nvkm_mem_map_dma+0x57/0x90 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nvkm_uvmm_mthd_map.isra.0+0x23b/0x3d0 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nvkm_uvmm_mthd+0x9e/0x540 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nvkm_object_mthd+0x17/0x40 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nvkm_ioctl_mthd+0x5d/0xc0 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nvkm_ioctl+0x132/0x2a0 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nvkm_client_ioctl+0xe/0x20 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nvif_object_mthd+0xd8/0x220 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nvif_vmm_map+0x87/0x150 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nouveau_mem_map+0xab/0x100 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nouveau_vma_new+0x223/0x250 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nouveau_gem_object_open+0x1ce/0x1f0 [nouveau]
Aug 26 19:08:11 balthasar kernel:  drm_gem_handle_create_tail+0xd4/0x1a0
Aug 26 19:08:11 balthasar kernel:  drm_gem_handle_create+0x35/0x50
Aug 26 19:08:11 balthasar kernel:  nouveau_gem_ioctl_new+0xdd/0x170 [nouveau]
Aug 26 19:08:11 balthasar kernel:  ? __pfx_nouveau_gem_ioctl_new+0x10/0x10 [nouveau]
Aug 26 19:08:11 balthasar kernel:  drm_ioctl_kernel+0xb9/0x120
Aug 26 19:08:11 balthasar kernel:  drm_ioctl+0x301/0x5a0
Aug 26 19:08:11 balthasar kernel:  ? __pfx_nouveau_gem_ioctl_new+0x10/0x10 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nouveau_drm_ioctl+0x61/0xc0 [nouveau]
Aug 26 19:08:11 balthasar kernel:  __x64_sys_ioctl+0xa0/0xf0
Aug 26 19:08:11 balthasar kernel:  x64_sys_call+0xa68/0x24b0
Aug 26 19:08:11 balthasar kernel:  do_syscall_64+0x80/0x170
Aug 26 19:08:11 balthasar kernel:  ? count_memcg_events.constprop.0+0x2a/0x50
Aug 26 19:08:11 balthasar kernel:  ? handle_mm_fault+0xaf/0x340
Aug 26 19:08:11 balthasar kernel:  ? do_user_addr_fault+0x18d/0x690
Aug 26 19:08:11 balthasar kernel:  ? irqentry_exit_to_user_mode+0x76/0x270
Aug 26 19:08:11 balthasar kernel:  ? irqentry_exit+0x43/0x50
Aug 26 19:08:11 balthasar kernel:  ? exc_page_fault+0x93/0x1b0
Aug 26 19:08:11 balthasar kernel:  entry_SYSCALL_64_after_hwframe+0x76/0x7e
Aug 26 19:08:11 balthasar kernel: RIP: 0033:0x7369e0d1a94f
Aug 26 19:08:11 balthasar kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1f 48 8b 44 24 >
Aug 26 19:08:11 balthasar kernel: RSP: 002b:00007fff63d87dc0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Aug 26 19:08:11 balthasar kernel: RAX: ffffffffffffffda RBX: 00007fff63d87e70 RCX: 00007369e0d1a94f
Aug 26 19:08:11 balthasar kernel: RDX: 00007fff63d87e70 RSI: 00000000c0306480 RDI: 0000000000000012
Aug 26 19:08:11 balthasar kernel: RBP: 00000000c0306480 R08: 00006017eabeb010 R09: 00006017ec4f6fa0
Aug 26 19:08:11 balthasar kernel: R10: 0000000000000007 R11: 0000000000000246 R12: 00006017eac2f910
Aug 26 19:08:11 balthasar kernel: R13: 0000000000000012 R14: 00007fff63d87e70 R15: 0000000000000900
Aug 26 19:08:11 balthasar kernel:  </TASK>
Aug 26 19:08:11 balthasar kernel: ---[ end trace 0000000000000000 ]---
Aug 26 19:08:11 balthasar kernel: nouveau 0000:01:00.0: timer: stalled at ffffffffffffffff
pjreed commented 2 months ago

I just thought I'd add that I have a System76 Gazelle and have been dealing with this issue for about a year and a half now.

At some point not too long after I first got this laptop, I started to have the same issue you're describing. I opened a ticket and spent a while diagnosing it with System76 support, and eventually I sent my laptop in and they replaced the mainboard, and the whole process of sending it in and getting it back took a few weeks. After I got it back, I continued to have the same problem. I really couldn't afford to be without my work computer for a few more weeks, so I've just gotten used to having my laptop randomly lock up when I'm away from it.

I spent a while trying to debug it and found out that this issue happens specifically when the GPU wakes up from being in a low-power mode, and running a low-power process that's constantly touching the GPU (like glxgears) seems to alleviate the issue to a degree. Without it, my laptop often locks up at least once a day, sometimes more often, and on occasion even while I'm using it; if I just leave glxgears running in a corner, it will often be fine for several days at a time, sometimes over a week.

The interesting thing is that sometime recently, I realized my laptop had reached a point where it had been running over three weeks solid without freezing. I suspect that nvidia-driver-550 in specific may have done something to help, because I updated to nvidia-driver-555 a week ago and the problem suddenly resumed; now I'm getting freezes regularly again.

I just tried installing nvidia-driver-550-server, and I'm running on that now. No freezes yet, but it's only been 30 minutes, so it remains to be seen if that will work as well as nvidia-driver-550. I really wish System76 would preserve at least their last few releases on their apt server...

mdbartos commented 2 months ago

I believe I have solved the issue on my machine.

I initially tried to run a live disk but was unable to see the boot menu because no video signal was being sent to the monitor before the login screen appeared. I then attempted to resolve the problem by adding 'nomodeset' to the kernel boot parameters. However, this made it so that video signal was never output to the monitor at all, and I thought I had bricked my computer.

After consulting the hardware manual for the Thelio Mira, I realized there were another set of dedicated HDMI/Displayport ports on the GPU itself. I unplugged my HDMI cable from the integrated graphics HDMI port and into the dedicated graphics HDMI port. After this change, I was able to get video signal and see the boot menu.

Moreover, after this change, the freezing issue has not returned, and I'm not getting warnings and errors in journalctl anymore. I thus upgraded to NVIDIA driver 555 again. There were also some updated system packages I installed from system76, but I don't think they are relevant.

tl;dr: I plugged my monitor into the dedicated graphics HDMI port instead of the integrated graphics HDMI port on my Thelio Mira. This solved the lack of video output at boot time, and the freezing issue has not returned. I am running NVIDIA driver 555 again.

mdbartos commented 2 months ago

Spoke too soon, the computer froze again---this time with no video output. The uptime was much longer this time around though.

leviport commented 2 months ago

Since you have System76 hardware, I recommend opening a support ticket: https://support.system76.com

esplinr commented 2 months ago

I'm still seeing this happen even after installing the latest updates to the NVidia driver from System76.

I can reliably reproduce it by locking my screen without suspending the laptop; when the machine goes into low power mode it locks up and the fans go to full blast. After the reboot, the previous boot's log messages contain the same errors about GPU falling off the bus and nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state.

The problem is manageable by updating Settings -> Power -> Screen Blank to "Never" and shortening the time to Automatic Suspend.

Next step for me is to open a support ticket with System76.

esplinr commented 2 months ago

Support confirmed that rolling back to the 550 driver is the recommended workaround until the 560 driver is released.

I see that the 560 driver is now in the repository, so hopefully that resolves the problem.

mmstick commented 2 months ago

560 has already been released

mdbartos commented 2 months ago

Update: I ended up getting an advance replacement for my machine (Thelio Mira). I believe the issue was hardware-related, as the freezing persisted even when running a live disk. After getting a replacement, I haven't had any freezing issues.

However, I did notice that the CPU and GPU temperatures were intermittently running rather hot on the new machine (~85 C for a few seconds at a time) under the default 'Balanced' power profile when running heavy workloads. The fans also tend to speed up and slow down rather than maintain a steady level. The temperature issues and fan thrashing went away after switching to 'Power Saver'. I wonder if these intermittent high temperatures may have contributed to a hardware failure.

Vetpetmon commented 2 months ago

Having this issue since September 3, shortly after doing a fresh reinstall and updating from NVIDIA driver version 550 to 555. Persists after the update to 560. My hardware is not damaged and is not System76 hardware.

EDIT: Important details

GPU Driver version: 560.35.03
CUDA version: 12.6

Kernel: linux-image-6.9.3-76060903-generic             6.9.3-76060903.202405300957~1721174657~22.04~abb7c06         amd64        Linux kernel image for version 6.9.3 on 64 bit x86 SMP

Motherboard vendor: ASUSTeK COMPUTER INC.
Motherboard product: PRIME A320M-K

Firmware (BIOS) vendor: American Megatrends Inc.
Firmware version: 5216
Firmware date: 08/30/2019
Boot mode: uefi

Got the issue today, will be trying pcie_aspm=off pci=nommconf in boot/efi/loader/entries

First, logging is sane, but then in the middle of playing a Steam Proton game and turning down the graphical settings to reduce GPU temps from ~80 C to ~58 C, things run smoothly until about 5 minutes in, and everything goes choppy. I can hear audio, but my mic isn't going through. Not choppy, according to a friend, "It completely died" while my display and input slowed to ~5 FPS. Mouse too was affected.

Logs looked like this:

Sep 19 16:18:02 bubz kernel: [68183.615790] pcieport 0000:00:03.1: AER: Correctable error message received from 0000:00:00.0
Sep 19 16:18:02 bubz kernel: [68183.615810] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:02 bubz kernel: [68183.615818] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00001000/00006000
Sep 19 16:18:02 bubz kernel: [68183.615826] pcieport 0000:00:03.1:    [12] Timeout               
Sep 19 16:18:02 bubz kernel: [68183.637883] pcieport 0000:00:03.1: AER: Correctable error message received from 0000:00:00.0
Sep 19 16:18:02 bubz kernel: [68183.637897] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
Sep 19 16:18:02 bubz kernel: [68183.637902] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00000040/00006000
Sep 19 16:18:02 bubz kernel: [68183.637907] pcieport 0000:00:03.1:    [ 6] BadTLP                
Sep 19 16:18:02 bubz kernel: [68183.726186] pcieport 0000:00:03.1: AER: Correctable error message received from 0000:00:00.0
Sep 19 16:18:02 bubz kernel: [68183.726207] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:02 bubz kernel: [68183.726214] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00001000/00006000
Sep 19 16:18:02 bubz kernel: [68183.726223] pcieport 0000:00:03.1:    [12] Timeout               
Sep 19 16:18:02 bubz kernel: [68183.770080] pcieport 0000:00:03.1: AER: Correctable error message received from 0000:00:00.0
Sep 19 16:18:02 bubz kernel: [68183.770097] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
Sep 19 16:18:02 bubz kernel: [68183.770104] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00000040/00006000
Sep 19 16:18:02 bubz kernel: [68183.770112] pcieport 0000:00:03.1:    [ 6] BadTLP                
Sep 19 16:18:02 bubz kernel: [68183.792116] pcieport 0000:00:03.1: AER: Correctable error message received from 0000:00:00.0
Sep 19 16:18:02 bubz kernel: [68183.792126] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
Sep 19 16:18:02 bubz kernel: [68183.792129] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00000040/00006000
Sep 19 16:18:02 bubz kernel: [68183.792133] pcieport 0000:00:03.1:    [ 6] BadTLP                
Sep 19 16:18:02 bubz kernel: [68183.803143] pcieport 0000:00:03.1: AER: Correctable error message received from 0000:00:00.0
Sep 19 16:18:02 bubz kernel: [68183.803156] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
Sep 19 16:18:02 bubz kernel: [68183.803160] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00000040/00006000
Sep 19 16:18:02 bubz kernel: [68183.803164] pcieport 0000:00:03.1:    [ 6] BadTLP                
Sep 19 16:18:02 bubz kernel: [68183.814216] pcieport 0000:00:03.1: AER: Correctable error message received from 0000:00:00.0
Sep 19 16:18:02 bubz kernel: [68183.814231] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:02 bubz kernel: [68183.814236] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00001000/00006000
Sep 19 16:18:02 bubz kernel: [68183.814242] pcieport 0000:00:03.1:    [12] Timeout               
Sep 19 16:18:02 bubz kernel: [68183.825183] pcieport 0000:00:03.1: AER: Multiple Correctable error message received from 0000:00:00.0
Sep 19 16:18:02 bubz kernel: [68183.825204] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:02 bubz kernel: [68183.825209] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00001040/00006000
Sep 19 16:18:02 bubz kernel: [68183.825215] pcieport 0000:00:03.1:    [ 6] BadTLP                
Sep 19 16:18:02 bubz kernel: [68183.825220] pcieport 0000:00:03.1:    [12] Timeout               
Sep 19 16:18:02 bubz kernel: [68183.825226] nvidia 0000:07:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:02 bubz kernel: [68183.825231] nvidia 0000:07:00.0:   device [10de:2184] error status/mask=00001000/0000a000
Sep 19 16:18:02 bubz kernel: [68183.825236] nvidia 0000:07:00.0:    [12] Timeout               
Sep 19 16:18:02 bubz kernel: [68183.825243] snd_hda_intel 0000:07:00.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:02 bubz kernel: [68183.825247] snd_hda_intel 0000:07:00.1:   device [10de:1aeb] error status/mask=00001000/0000a000
Sep 19 16:18:02 bubz kernel: [68183.825252] snd_hda_intel 0000:07:00.1:    [12] Timeout               

.... Gets more extreme....

Sep 19 16:18:04 bubz kernel: [68185.875052] pcieport 0000:00:03.1: AER: Multiple Correctable error message received from 0000:00:00.0
Sep 19 16:18:04 bubz kernel: [68185.875237] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:04 bubz kernel: [68185.875244] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=000011c0/00006000
Sep 19 16:18:04 bubz kernel: [68185.875252] pcieport 0000:00:03.1:    [ 6] BadTLP                
Sep 19 16:18:04 bubz kernel: [68185.875259] pcieport 0000:00:03.1:    [ 7] BadDLLP               
Sep 19 16:18:04 bubz kernel: [68185.875265] pcieport 0000:00:03.1:    [ 8] Rollover              
Sep 19 16:18:04 bubz kernel: [68185.875270] pcieport 0000:00:03.1:    [12] Timeout               
Sep 19 16:18:04 bubz kernel: [68185.875295] nvidia 0000:07:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:04 bubz kernel: [68185.875301] nvidia 0000:07:00.0:   device [10de:2184] error status/mask=00001000/0000a000
Sep 19 16:18:04 bubz kernel: [68185.875308] nvidia 0000:07:00.0:    [12] Timeout               
Sep 19 16:18:04 bubz kernel: [68185.875450] snd_hda_intel 0000:07:00.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:04 bubz kernel: [68185.875457] snd_hda_intel 0000:07:00.1:   device [10de:1aeb] error status/mask=00001000/0000a000
Sep 19 16:18:04 bubz kernel: [68185.875463] snd_hda_intel 0000:07:00.1:    [12] Timeout               
Sep 19 16:18:04 bubz kernel: [68185.886092] pcieport 0000:00:03.1: AER: Multiple Correctable error message received from 0000:07:00.0
Sep 19 16:18:04 bubz kernel: [68185.886277] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:04 bubz kernel: [68185.886280] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=000010c0/00006000
Sep 19 16:18:04 bubz kernel: [68185.886284] pcieport 0000:00:03.1:    [ 6] BadTLP                
Sep 19 16:18:04 bubz kernel: [68185.886287] pcieport 0000:00:03.1:    [ 7] BadDLLP               
Sep 19 16:18:04 bubz kernel: [68185.886290] pcieport 0000:00:03.1:    [12] Timeout               
Sep 19 16:18:04 bubz kernel: [68185.886352] nvidia 0000:07:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:04 bubz kernel: [68185.886355] nvidia 0000:07:00.0:   device [10de:2184] error status/mask=00001100/0000a000
Sep 19 16:18:04 bubz kernel: [68185.886359] nvidia 0000:07:00.0:    [ 8] Rollover              
Sep 19 16:18:04 bubz kernel: [68185.886362] nvidia 0000:07:00.0:    [12] Timeout               
Sep 19 16:18:04 bubz kernel: [68185.886365] nvidia 0000:07:00.0: AER:   Error of this Agent is reported first
Sep 19 16:18:04 bubz kernel: [68185.886669] snd_hda_intel 0000:07:00.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:04 bubz kernel: [68185.886672] snd_hda_intel 0000:07:00.1:   device [10de:1aeb] error status/mask=00001100/0000a000
Sep 19 16:18:04 bubz kernel: [68185.886675] snd_hda_intel 0000:07:00.1:    [ 8] Rollover              
Sep 19 16:18:04 bubz kernel: [68185.886678] snd_hda_intel 0000:07:00.1:    [12] Timeout               
Sep 19 16:18:04 bubz kernel: [68185.897099] pcieport 0000:00:03.1: AER: Multiple Correctable error message received from 0000:07:00.0
Sep 19 16:18:04 bubz kernel: [68185.897184] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:04 bubz kernel: [68185.897192] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=000011c0/00006000
Sep 19 16:18:04 bubz kernel: [68185.897201] pcieport 0000:00:03.1:    [ 6] BadTLP                
Sep 19 16:18:04 bubz kernel: [68185.897208] pcieport 0000:00:03.1:    [ 7] BadDLLP               
Sep 19 16:18:04 bubz kernel: [68185.897215] pcieport 0000:00:03.1:    [ 8] Rollover              
Sep 19 16:18:04 bubz kernel: [68185.897222] pcieport 0000:00:03.1:    [12] Timeout               
Sep 19 16:18:04 bubz kernel: [68185.897243] nvidia 0000:07:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:04 bubz kernel: [68185.897251] nvidia 0000:07:00.0:   device [10de:2184] error status/mask=00001100/0000a000
Sep 19 16:18:04 bubz kernel: [68185.897259] nvidia 0000:07:00.0:    [ 8] Rollover              
Sep 19 16:18:04 bubz kernel: [68185.897266] nvidia 0000:07:00.0:    [12] Timeout               
Sep 19 16:18:04 bubz kernel: [68185.897273] nvidia 0000:07:00.0: AER:   Error of this Agent is reported first
Sep 19 16:18:04 bubz kernel: [68185.897375] snd_hda_intel 0000:07:00.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:04 bubz kernel: [68185.897382] snd_hda_intel 0000:07:00.1:   device [10de:1aeb] error status/mask=00001100/0000a000
Sep 19 16:18:04 bubz kernel: [68185.897406] snd_hda_intel 0000:07:00.1:    [ 8] Rollover              
Sep 19 16:18:04 bubz kernel: [68185.897413] snd_hda_intel 0000:07:00.1:    [12] Timeout               
Sep 19 16:18:04 bubz kernel: [68185.908150] pcieport 0000:00:03.1: AER: Multiple Correctable error message received from 0000:00:00.0
Sep 19 16:18:04 bubz kernel: [68185.908740] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:04 bubz kernel: [68185.908748] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=000010c0/00006000
Sep 19 16:18:04 bubz kernel: [68185.908756] pcieport 0000:00:03.1:    [ 6] BadTLP                
Sep 19 16:18:04 bubz kernel: [68185.908762] pcieport 0000:00:03.1:    [ 7] BadDLLP               
Sep 19 16:18:04 bubz kernel: [68185.908768] pcieport 0000:00:03.1:    [12] Timeout               
Sep 19 16:18:04 bubz kernel: [68185.908777] nvidia 0000:07:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:04 bubz kernel: [68185.908784] nvidia 0000:07:00.0:   device [10de:2184] error status/mask=00001000/0000a000
Sep 19 16:18:04 bubz kernel: [68185.908790] nvidia 0000:07:00.0:    [12] Timeout               
Sep 19 16:18:04 bubz kernel: [68185.908799] snd_hda_intel 0000:07:00.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:04 bubz kernel: [68185.908804] snd_hda_intel 0000:07:00.1:   device [10de:1aeb] error status/mask=00001000/0000a000
Sep 19 16:18:04 bubz kernel: [68185.908811] snd_hda_intel 0000:07:00.1:    [12] Timeout               
Sep 19 16:18:04 bubz kernel: [68185.919143] pcieport 0000:00:03.1: AER: Multiple Correctable error message received from 0000:07:00.0
Sep 19 16:18:04 bubz kernel: [68185.920895] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:04 bubz kernel: [68185.920900] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=000010c0/00006000
Sep 19 16:18:04 bubz kernel: [68185.920904] pcieport 0000:00:03.1:    [ 6] BadTLP                
Sep 19 16:18:04 bubz kernel: [68185.920907] pcieport 0000:00:03.1:    [ 7] BadDLLP               
Sep 19 16:18:04 bubz kernel: [68185.920909] pcieport 0000:00:03.1:    [12] Timeout          

Error codes 6, 7, 8, and 12 become more and more consistent. GDM has a "lol" moment about 8 seconds in:

Sep 19 16:18:14 bubz /usr/libexec/gdm-x-session[2026]: (EE) event3  - SINOWEALTH Game Mouse: client bug: event processing lagging behind by 376ms, your system is too slow

Then it sped back up after... about 5000 or so error lines, but mic audio still did not go through. Short-lived speed-up, logging calms down, but then...

Sep 19 16:18:18 bubz kernel: [68199.145805] pcieport 0000:00:03.1: AER: Multiple Correctable error message received from 0000:00:00.0
Sep 19 16:18:18 bubz kernel: [68199.145890] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:18 bubz kernel: [68199.145894] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00001000/00006000
Sep 19 16:18:18 bubz kernel: [68199.145898] pcieport 0000:00:03.1:    [12] Timeout               
Sep 19 16:18:18 bubz kernel: [68199.146058] nvidia 0000:07:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:18 bubz kernel: [68199.146061] nvidia 0000:07:00.0:   device [10de:2184] error status/mask=00001000/0000a000
Sep 19 16:18:18 bubz kernel: [68199.146065] nvidia 0000:07:00.0:    [12] Timeout               
Sep 19 16:18:18 bubz kernel: [68199.146071] snd_hda_intel 0000:07:00.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:18 bubz kernel: [68199.146074] snd_hda_intel 0000:07:00.1:   device [10de:1aeb] error status/mask=00001000/0000a000
Sep 19 16:18:18 bubz kernel: [68199.146077] snd_hda_intel 0000:07:00.1:    [12] Timeout               
Sep 19 16:18:18 bubz kernel: [68199.146096] pcieport 0000:00:03.1: AER: Multiple Correctable error message received from 0000:00:00.0
Sep 19 16:18:18 bubz kernel: [68199.146106] pcieport 0000:00:03.1: AER: found no error details for 0000:00:00.0
Sep 19 16:18:18 bubz kernel: [68199.146110] pcieport 0000:00:03.1: AER: Multiple Correctable error message received from 0000:00:00.0
Sep 19 16:18:18 bubz kernel: [68199.146122] pcieport 0000:00:03.1: AER: found no error details for 0000:00:00.0
Sep 19 16:18:18 bubz kernel: [68199.146125] pcieport 0000:00:03.1: AER: Multiple Correctable error message received from 0000:00:00.0
Sep 19 16:18:18 bubz kernel: [68199.146177] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
Sep 19 16:18:18 bubz kernel: [68199.146180] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00000040/00006000
Sep 19 16:18:18 bubz kernel: [68199.146184] pcieport 0000:00:03.1:    [ 6] BadTLP                
Sep 19 16:18:18 bubz kernel: [68199.146189] pcieport 0000:00:03.1: AER: Multiple Correctable error message received from 0000:07:00.0

EDIT 3: Oh, I dug further. I found ONE error code 14, which couldn't be corrected.

Sep 19 16:18:24 bubz kernel: [68205.828117] nvidia 0000:07:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:24 bubz kernel: [68205.828120] nvidia 0000:07:00.0:   device [10de:2184] error status/mask=00001000/0000a000
Sep 19 16:18:24 bubz kernel: [68205.828123] nvidia 0000:07:00.0:    [12] Timeout               
Sep 19 16:18:24 bubz kernel: [68205.828223] snd_hda_intel 0000:07:00.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:24 bubz kernel: [68205.828225] snd_hda_intel 0000:07:00.1:   device [10de:1aeb] error status/mask=00001000/0000a000
Sep 19 16:18:24 bubz kernel: [68205.828228] snd_hda_intel 0000:07:00.1:    [12] Timeout               
Sep 19 16:18:24 bubz kernel: [68205.828302] pcieport 0000:00:03.1: AER: Multiple Uncorrectable (Non-Fatal) error message received from 0000:07:00.0
Sep 19 16:18:24 bubz kernel: [68205.828423] nvidia 0000:07:00.0: PCIe Bus Error: severity=Uncorrectable (Non-Fatal), type=Transaction Layer, (Requester ID)
Sep 19 16:18:24 bubz kernel: [68205.828427] nvidia 0000:07:00.0:   device [10de:2184] error status/mask=00004000/00000000
Sep 19 16:18:24 bubz kernel: [68205.828430] nvidia 0000:07:00.0:    [14] CmpltTO                (First)
Sep 19 16:18:24 bubz kernel: [68205.860406] nvidia 0000:07:00.0: AER: can't recover (no error_detected callback)
Sep 19 16:18:24 bubz kernel: [68205.860412] snd_hda_intel 0000:07:00.1: AER: can't recover (no error_detected callback)
Sep 19 16:18:24 bubz kernel: [68205.860414] xhci_hcd 0000:07:00.2: AER: can't recover (no error_detected callback)
Sep 19 16:18:24 bubz kernel: [68205.860416] pci 0000:07:00.3: AER: can't recover (no error_detected callback)
Sep 19 16:18:24 bubz kernel: [68205.860445] pcieport 0000:00:03.1: AER: device recovery failed
Sep 19 16:18:24 bubz kernel: [68205.860448] pcieport 0000:00:03.1: AER: Multiple Correctable error message receive

Picks back up, now failing to fetch details for some errors. After 16 seconds of starting, the nail is hit into the coffin, and my user session is completely dead:

Sep 19 16:18:28 bubz kernel: [68209.812231] pcieport 0000:00:03.1: AER: Multiple Correctable error message received from 0000:00:00.0
Sep 19 16:18:28 bubz kernel: [68209.855497] NVRM: GPU at PCI:0000:07:00: GPU-73994a87-f3a9-e97c-5add-f6a9813a6033
Sep 19 16:18:28 bubz kernel: [68209.855502] NVRM: Xid (PCI:0000:07:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Sep 19 16:18:28 bubz kernel: [68209.855518] NVRM: GPU 0000:07:00.0: GPU has fallen off the bus.
Sep 19 16:18:28 bubz kernel: [68209.855530] NVRM: A GPU crash dump has been created. If possible, please run
Sep 19 16:18:28 bubz kernel: [68209.855530] NVRM: nvidia-bug-report.sh as root to collect this data before
Sep 19 16:18:28 bubz kernel: [68209.855530] NVRM: the NVIDIA kernel module is unloaded.
Sep 19 16:18:28 bubz kernel: [68209.855638] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:28 bubz kernel: [68209.855644] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00003040/00006000
Sep 19 16:18:28 bubz kernel: [68209.855648] pcieport 0000:00:03.1:    [ 6] BadTLP                
Sep 19 16:18:28 bubz kernel: [68209.855652] pcieport 0000:00:03.1:    [12] Timeout               
Sep 19 16:18:28 bubz kernel: [68209.855659] pcieport 0000:00:03.1: AER: Multiple Correctable error message received from 0000:00:00.0
Sep 19 16:18:28 bubz kernel: [68209.855679] pcieport 0000:00:03.1: AER: found no error details for 0000:00:00.0
Sep 19 16:18:28 bubz kernel: [68209.959602] xhci_hcd 0000:07:00.2: Unable to change power state from D3hot to D0, device inaccessible
Sep 19 16:18:28 bubz kernel: [68210.031632] xhci_hcd 0000:07:00.2: Unable to change power state from D3cold to D0, device inaccessible
Sep 19 16:18:28 bubz kernel: [68210.031649] xhci_hcd 0000:07:00.2: Controller not ready at resume -19
Sep 19 16:18:28 bubz kernel: [68210.031652] xhci_hcd 0000:07:00.2: PCI post-resume error -19!
Sep 19 16:18:28 bubz kernel: [68210.031656] xhci_hcd 0000:07:00.2: HC died; cleaning up
Sep 19 16:18:42 bubz gnome-shell[2161]: Window manager warning: Failed to start restart helper: Failed to execute child process “/usr/libexec/mutter-restart-helper” (No such file or directory)
Sep 19 16:18:42 bubz kernel: [68223.434374] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 19 16:18:42 bubz kernel: [68223.434389] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 19 16:18:47 bubz /usr/libexec/gdm-x-session[2026]: (WW) NVIDIA: Wait for channel idle timed out.

No recovery, besides forcing the power button.

Earlier, before updating on Sep 18, 2024, I had placed boot parameters and the problem didn't persist. Never had this happen in my 2023 installation of Pop_OS.

I've tried re-seating the GPU into the PCIe slot, thinking it's just physically loose. So, confirming that the GPU and PCIe ports are just fine on Windows, and that the GPU is securely seated in the slot, I suspect it's a kernel issue, specifically after going from a high power state to a low-power state.

Implicated PCI:

These are my results from sudo lspci -v:

Bridge:

00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge (prog-if 00 [Normal decode])
    Flags: bus master, fast devsel, latency 0, IRQ 27, IOMMU group 4
    Bus: primary=00, secondary=07, subordinate=07, sec-latency=0
    I/O behind bridge: 0000e000-0000efff [size=4K]
    Memory behind bridge: f5000000-f60fffff [size=17M]
    Prefetchable memory behind bridge: 00000000e0000000-00000000f20fffff [size=289M]
    Capabilities: [50] Power Management version 3
    Capabilities: [58] Express Root Port (Slot+), MSI 00
    Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
    Capabilities: [c0] Subsystem: ASUSTeK Computer Inc. Family 17h (Models 00h-0fh) PCIe GPP Bridge
    Capabilities: [c8] HyperTransport: MSI Mapping Enable+ Fixed+
    Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
    Capabilities: [150] Advanced Error Reporting
    Capabilities: [270] Secondary PCI Express
    Capabilities: [2a0] Access Control Services
    Capabilities: [370] L1 PM Substates
    Kernel driver in use: pcieport

GPU:

07:00.0 VGA compatible controller: NVIDIA Corporation TU116 [GeForce GTX 1660] (rev a1) (prog-if 00 [VGA controller])
    Subsystem: Micro-Star International Co., Ltd. [MSI] TU116 [GeForce GTX 1660]
    Flags: bus master, fast devsel, latency 0, IRQ 68, IOMMU group 13
    Memory at f5000000 (32-bit, non-prefetchable) [size=16M]
    Memory at e0000000 (64-bit, prefetchable) [size=256M]
    Memory at f0000000 (64-bit, prefetchable) [size=32M]
    I/O ports at e000 [size=128]
    Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
    Capabilities: [60] Power Management version 3
    Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
    Capabilities: [78] Express Legacy Endpoint, MSI 00
    Capabilities: [100] Virtual Channel
    Capabilities: [258] L1 PM Substates
    Capabilities: [128] Power Budgeting <?>
    Capabilities: [420] Advanced Error Reporting
    Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
    Capabilities: [900] Secondary PCI Express
    Capabilities: [bb0] Physical Resizable BAR
    Kernel driver in use: nvidia
    Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

EDIT 2: 07:00.2 (USB 3.1 Host Controller) also errored out after the GPU fell off of the bus. No idea what this component does, but it only shows up AFTER the userspace is terminated non-gracefully. It's the part of the GPU that reports the GPU is completely inaccessible after the power state change.

07:00.2 USB controller: NVIDIA Corporation TU116 USB 3.1 Host Controller (rev a1) (prog-if 30 [XHCI])
    Subsystem: Micro-Star International Co., Ltd. [MSI] TU116 USB 3.1 Host Controller
    Flags: fast devsel, IRQ 51, IOMMU group 13
    Memory at f2000000 (64-bit, prefetchable) [size=256K]
    Memory at f2040000 (64-bit, prefetchable) [size=64K]
    Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
    Capabilities: [78] Express Endpoint, MSI 00
    Capabilities: [b4] Power Management version 3
    Capabilities: [100] Advanced Error Reporting
    Kernel driver in use: xhci_hcd
    Kernel modules: xhci_pci

If anyone else has issues like I do, please open /var/log/syslog immediately after rebooting.

Vetpetmon commented 2 months ago

Please push out a fix in a newer kernel version! (Or, instructions to get the August 2023 (or Q1/Q2 2024) kernel back would be appreciated, I unfortunately lost those while doing a fresh reinstall of Pop_OS in August 2024, and that kernel version worked amazingly well with my hardware!) EDIT: linux-generic and all other related kernel components just showed up as upgrade-able, I cannot wait to see if this is fixed tomorrow morning!

I just checked my kernel version and found a 1:1 match with the initial issue poster's kernel version. I've have random Xid 79 errors with and without the PCIe bridge errors, so this can be replicated.

My GPU is a MSI GTX 1660. It appears to be an issue stemming from the pcieport kernel drivers, as that's where the error logging starts, and the error codes line up with being failures at the PCIe bridge. The GPU can be at 55 C, and still fall off the bus, so thermals aren't suspected, but often, closing a demanding program (or even turning down the graphics settings from medium to low) runs the risk of the bridge getting a bad TLP even at a GPU temp of 65 C, and then the system/session stability devolves from there.

Additionally, I suspect this might be somewhat related to this kernel version, but will need further testing to see if GNOME crashes from suspend put the system in a state unstable enough to cause bad TLPs, even after restarting GDM: https://github.com/pop-os/pop/issues/3254#issuecomment-2358686624

Vetpetmon commented 2 months ago

New log. With pci=nommconf added to boot parameters. No more wall of PCIe errors, but can confirm that this is not related to thermals or GPU utilization.

This happened while typing in a web browser (To be specific, this happened shortly after switching windows, clicking into a text box, and then the free happened after tying -), utilization was as low as it could get. Could not route into the PC or type anything in a terminal to get the bug report. Happened shortly (7 minutes) after waking from suspend, 3rd time. GPU is seated properly.

Sep 22 12:16:41 bubz kernel: [94962.028307] xhci_hcd 0000:07:00.2: Unable to change power state from D3hot to D0, device inaccessible
Sep 22 12:16:41 bubz kernel: [94962.088694] xhci_hcd 0000:07:00.2: Unable to change power state from D3cold to D0, device inaccessible
Sep 22 12:16:41 bubz kernel: [94962.088706] xhci_hcd 0000:07:00.2: Controller not ready at resume -19
Sep 22 12:16:41 bubz kernel: [94962.088709] xhci_hcd 0000:07:00.2: PCI post-resume error -19!
Sep 22 12:16:41 bubz kernel: [94962.088713] xhci_hcd 0000:07:00.2: HC died; cleaning up
Sep 22 12:16:52 bubz kernel: [94972.697458] NVRM: GPU at PCI:0000:07:00: GPU-73994a87-f3a9-e97c-5add-f6a9813a6033
Sep 22 12:16:52 bubz kernel: [94972.697464] NVRM: Xid (PCI:0000:07:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Sep 22 12:16:52 bubz kernel: [94972.697476] NVRM: GPU 0000:07:00.0: GPU has fallen off the bus.
Sep 22 12:16:52 bubz kernel: [94972.697484] NVRM: A GPU crash dump has been created. If possible, please run
Sep 22 12:16:52 bubz kernel: [94972.697484] NVRM: nvidia-bug-report.sh as root to collect this data before
Sep 22 12:16:52 bubz kernel: [94972.697484] NVRM: the NVIDIA kernel module is unloaded.
Sep 22 12:16:52 bubz firefox.desktop[2610]: [GFX1-]: Detect DeviceReset DeviceResetReason::RESET DeviceResetDetectPlace::WR_POST_UPDATE in Parent process
Sep 22 12:16:52 bubz firefox.desktop[2610]: [GFX1-]: Failed to make render context current during destroying.
Sep 22 12:16:52 bubz kernel: [94972.721916] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 22 12:16:52 bubz kernel: [94972.721935] nvidia-modeset: ERROR: GPU:0: Failed to querSep 22 12:17:40 bubz systemd-modules-load[423]: Inserted module 'lp'

How does this keep happening?

EDIT: I am now trying pcie_aspm=off and also updated by UEFI firmware from 220 to 371. We'll see if this fixes it.

DekosAnjo commented 2 months ago

This is happening to me too, but only when I'm on Nvidia-only mode. The log is the same, but instead of a black screen it freezes.

Vetpetmon commented 2 months ago

This is happening to me too, but only when I'm on Nvidia-only mode. The log is the same, but instead of a black screen it freezes.

So I'm not the only one here going absolutely INSANE trying to fix this by disabling aspm and mmconf (and eventually trying pci=nomsi) in the boot parameters! Can you please provide the kernel version via uname -a?

Additionally, please post your syslog from /var/log/syslog! I believe this is an error with the pcieport kernel driver!

Vetpetmon commented 2 months ago

I'm really convinced this is a drivers issue, specifically the pcieport drivers provided by the current 6.9.3-jammy kernel, and have heard nothing about a fix. I'll see about refreshing my install from the USB, which hasn't updated to newer packages, or just trying the COSMIC Epoch Alpha 2, whichever happens first. If nothing works, I guess I will have to switch to mainline Ubuntu or try Arch again.

DekosAnjo commented 2 months ago

uname -a Linux xxx 6.9.3-76060903-generic #202405300957~1726766035~22.04~4092a0e SMP PREEMPT_DYNAMIC Thu S x86_64 x86_64 x86_64 GNU/Linux syslog

Sep 23 18:37:03 xxx kernel: [   34.193735] pcieport 0000:00:01.1: pciehp: Slot(0): Link Down
Sep 23 18:37:03 xxx kernel: [   34.193743] pcieport 0000:00:01.1: pciehp: Slot(0): Card not present
Sep 23 18:37:03 xxx kernel: [   34.193750] NVRM: GPU at PCI:0000:01:00: GPU-f8600eb8-f1df-acfb-fad9-d24be732714f
Sep 23 18:37:03 xxx kernel: [   34.193759] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Sep 23 18:37:03 xxx kernel: [   34.193770] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Sep 23 18:37:03 xxx /usr/libexec/gdm-x-session[3434]: (WW) NVIDIA: Failed to bind sideband socket to
Sep 23 18:37:03 xxx /usr/libexec/gdm-x-session[3434]: (WW) NVIDIA:     '/var/run/nvidia-xdriver-62d320f4' Permission denied
Sep 23 18:37:03 xxx /usr/libexec/gdm-x-session[3434]: (II) NVIDIA: Reserving 24576.00 MB of virtual memory for indirect memory
Sep 23 18:37:03 xxx /usr/libexec/gdm-x-session[3434]: (II) NVIDIA:     access.
Sep 23 18:37:03 xxx /usr/libexec/gdm-x-session[3434]: (WW) NVIDIA(GPU-0): Failed to enter interactive mode.
Sep 23 18:37:03 xxx /usr/libexec/gdm-x-session[3434]: (EE) NVIDIA(GPU-0): Push buffer DMA allocation failed
Sep 23 18:37:03 xxx /usr/libexec/gdm-x-session[3434]: (EE) NVIDIA(0): Failed to allocate push buffer
Sep 23 18:37:03 xxx kernel: [   34.195228] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c67d:0:0:0x0000000f
Sep 23 18:37:03 xxx kernel: [   34.195263] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c67e:0:0:0x0000000f
Sep 23 18:37:03 xxx kernel: [   34.195283] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c67e:1:0:0x0000000f
Sep 23 18:37:03 xxx kernel: [   34.195308] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c67e:2:0:0x0000000f
Sep 23 18:37:03 xxx kernel: [   34.195327] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c67e:3:0:0x0000000f
Sep 23 18:37:03 xxx kernel: [   34.195356] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c67e:4:0:0x0000000f
Sep 23 18:37:03 xxx kernel: [   34.195375] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c67e:5:0:0x0000000f
Sep 23 18:37:03 xxx kernel: [   34.195404] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c67e:6:0:0x0000000f
Sep 23 18:37:03 xxx kernel: [   34.195423] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c67e:7:0:0x0000000f
Sep 23 18:37:05 xxx /usr/bin/nvidia-powerd[1566]: Failed to get topology status f
Sep 23 18:37:07 xxx /usr/bin/nvidia-powerd[1566]: Failed to get topology status f
Sep 23 18:37:08 xxx kernel: [   39.195778] nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c67d:0 2:0:4048:4040
Sep 23 18:37:09 xxx /usr/bin/nvidia-powerd[1566]: Failed to get topology status f
Sep 23 18:37:09 xxx kernel: [   39.955771] NVRM: Error in service of callback
Sep 23 18:37:11 xxx /usr/bin/nvidia-powerd[1566]: Failed to get topology status f
Sep 23 18:37:11 xxx pop-system-updater[3457]:  INFO pop_system_updater::service::session: Restarting service in 5s because DISPLAY could not be found
Sep 23 18:37:13 xxx /usr/bin/nvidia-powerd[1566]: Failed to get topology status f
Sep 23 18:37:13 xxx kernel: [   44.195790] nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c67d:0 2:0:4048:4040
Sep 23 18:37:15 xxx /usr/bin/nvidia-powerd[1566]: Failed to get topology status f
Sep 23 18:37:17 xxx /usr/bin/nvidia-powerd[1566]: Failed to get topology status f
Sep 23 18:37:18 xxx kernel: [   49.195802] nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c67d:0 2:0:4048:4040
^@^@^@^@^......
Vetpetmon commented 2 months ago

My uname -a results as of today: Linux bubz 6.9.3-76060903-generic #202405300957~1726766035~22.04~4092a0e SMP PREEMPT_DYNAMIC Thu S x86_64 x86_64 x86_64 GNU/Linux @DekosAnjo Somewhat different motherboard I'm presuming, so error reports are looking different, but the same thing happens. (GPU falls off the bus, display freezes on the frame, no mouse input works, does not respond to any keyboard input, meaning a REISUB cannot be performed)

pcieport 0000:00:01.1: pciehp: Slot(0): Link Down is similar or the equivalent to my

Sep 19 16:18:04 bubz kernel: [68185.919143] pcieport 0000:00:03.1: AER: Multiple Correctable error message received from 0000:07:00.0
Sep 19 16:18:04 bubz kernel: [68185.920895] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:04 bubz kernel: [68185.920900] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=000010c0/00006000
Sep 19 16:18:04 bubz kernel: [68185.920904] pcieport 0000:00:03.1:    [ 6] BadTLP                
Sep 19 16:18:04 bubz kernel: [68185.920907] pcieport 0000:00:03.1:    [ 7] BadDLLP               
Sep 19 16:18:04 bubz kernel: [68185.920909] pcieport 0000:00:03.1:    [12] Timeout 

Seems like the drivers provided by our kernel are bad.

Further testing has shown a downgrade back to older kernel versions work well. I have NOT tried COSMIC Epoch 1 or Ubuntu 24.04 (6.11 kernel versions). Ubuntu's 22.04's kernels (albeit they are more dated than Pop 22.04's) work on my system just fine.

Two theories

If my intuition and my degree are actually worth something, and my extensive work into constantly trying to recreate this damned pest of a kernel bug has yielded something useful, is that the kernel pcieport drivers are experiencing a memory leak. Many factors that line up with my hypothesis will be posted here:

  1. The longer the uptime, the more likely this will happen
  2. The more the GPU is utilized, there is a small chance the risk will increase.
  3. Putting the PC to suspend (sleep) doubles the likelihood every time for the error to appear. After the 2nd wake-up, I got the error in under 10 minutes.
  4. Doing a cold boot (after being powered off for at least 30 minutes) drastically reduces the risk of the error appearing.
  5. Doing a reboot right after the GPU falls off the bus from the pcieport driver error can almost guarantee (75% likelihood) the error happens again within 15 minutes or less.

My second theory is, thanks to your logs (I cannot state this enough; @DekosAnjo , you are awesome!) I was able to do even more searching (via searching pciehp: Slot(0): Link Down on DuckDuckGo) and pulled up better results. From my results, it appears that is that something is very wrong with the hotplug functionality in the kernel & its drivers. Monitoring the guts of my PC has been very easy thanks to the glass siding, and found absolutely no abnormal movement on the 60 FPS iPhone video when an incident occurs. Something tells me a hotplug signal is being sent out in error, and then the kernel thinks the GPU was unplugged when it was not.

I've ruled out all possible hardware errors, I ruled out NVIDIA's drivers, it is for certain the pcieport drivers now.

DekosAnjo commented 2 months ago

its a m6500r notebook from asus

Vetpetmon commented 1 month ago

Five minutes after closing a game:

Sep 27 10:10:19 bubz steam[5558]: Uploaded AppInterfaceStats to Steam
Sep 27 10:10:19 bubz steam[5558]: Removing process 10147 for gameID 2074920
Sep 27 10:10:19 bubz steam[5558]: Removing process 9829 for gameID 2074920
Sep 27 10:10:19 bubz steam[5558]: Removing process 9826 for gameID 2074920
Sep 27 10:10:19 bubz steam[5558]: Removing process 9824 for gameID 2074920
Sep 27 10:10:19 bubz steam[5558]: Removing process 9810 for gameID 2074920
Sep 27 10:10:19 bubz steam[5558]: Removing process 9588 for gameID 2074920
Sep 27 10:10:19 bubz steam[5558]: Removing process 9534 for gameID 2074920
Sep 27 10:10:19 bubz steam[5558]: Removing process 9530 for gameID 2074920
Sep 27 10:10:19 bubz steam[5558]: Removing process 9513 for gameID 2074920
Sep 27 10:10:19 bubz steam[5558]: Removing process 9499 for gameID 2074920
Sep 27 10:10:19 bubz steam[5558]: Removing process 9493 for gameID 2074920
Sep 27 10:10:19 bubz steam[5558]: Removing process 9480 for gameID 2074920
Sep 27 10:10:19 bubz steam[5558]: Removing process 9470 for gameID 2074920
Sep 27 10:10:19 bubz steam[5558]: Removing process 9467 for gameID 2074920
Sep 27 10:10:19 bubz steam[5558]: Removing process 9465 for gameID 2074920
Sep 27 10:10:19 bubz steam[5558]: Removing process 9462 for gameID 2074920
Sep 27 10:10:19 bubz steam[5558]: Removing process 9461 for gameID 2074920
Sep 27 10:10:19 bubz steam[5558]: Removing process 9460 for gameID 2074920
Sep 27 10:10:19 bubz steam[5558]: Removing process 9459 for gameID 2074920
Sep 27 10:10:19 bubz steam[5558]: Removing process 9358 for gameID 2074920
Sep 27 10:10:19 bubz steam[5558]: Removing process 9357 for gameID 2074920
Sep 27 10:10:19 bubz steam[5558]: Removing process 9356 for gameID 2074920
Sep 27 10:10:19 bubz steam[5558]: Removing process 9355 for gameID 2074920
Sep 27 10:10:21 bubz steam[5558]: (process:9652): GLib-GObject-CRITICAL **: 10:10:21.136: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
Sep 27 10:10:31 bubz steam[5558]: reaping pid: 9652 -- gameoverlayui
Sep 27 10:10:46 bubz steam[5558]: src/tier1/fileio.cpp (2385) : cbToRead <= 1024 * 1024 * 1024
Sep 27 10:10:46 bubz steam[5558]: src/tier1/fileio.cpp (2385) : cbToRead <= 1024 * 1024 * 1024
Sep 27 10:10:46 bubz steam[5558]: assert_20240927101046_132.dmp[11123]: Uploading dump (out-of-process)
Sep 27 10:10:46 bubz steam[5558]: /tmp/dumps/assert_20240927101046_132.dmp
Sep 27 10:10:46 bubz assert_20240927101046_132.dmp[11123]: Uploading dump (out-of-process)#012/tmp/dumps/assert_20240927101046_132.dmp
Sep 27 10:10:46 bubz steam[5558]: src/tier1/fileio.cpp (2385) : cbToRead <= 1024 * 1024 * 1024
Sep 27 10:10:46 bubz steam[5558]: message repeated 7 times: [ src/tier1/fileio.cpp (2385) : cbToRead <= 1024 * 1024 * 1024]
Sep 27 10:10:47 bubz assert_20240927101046_132.dmp[11123]: Finished uploading minidump (out-of-process): success = yes
Sep 27 10:10:47 bubz assert_20240927101046_132.dmp[11123]: response: CrashID=bp-4f386166-614c-4640-bb2a-ba8bf2240927
Sep 27 10:10:47 bubz steam[5558]: assert_20240927101046_132.dmp[11123]: Finished uploading minidump (out-of-process): success = yes
Sep 27 10:10:47 bubz steam[5558]: assert_20240927101046_132.dmp[11123]: response: CrashID=bp-4f386166-614c-4640-bb2a-ba8bf2240927
Sep 27 10:10:47 bubz steam[5558]: assert_20240927101046_132.dmp[11123]: file ''/tmp/dumps/assert_20240927101046_132.dmp'', upload yes: ''CrashID=bp-4f386166-614c-4640-bb2a-ba8bf2240927''
Sep 27 10:10:47 bubz assert_20240927101046_132.dmp[11123]: file ''/tmp/dumps/assert_20240927101046_132.dmp'', upload yes: ''CrashID=bp-4f386166-614c-4640-bb2a-ba8bf2240927''
Sep 27 10:11:37 bubz systemd[2123]: Started VTE child process 11154 launched by gnome-terminal-server process 7684.
Sep 27 10:13:53 bubz systemd[1]: Starting Refresh fwupd metadata and update motd...
Sep 27 10:13:53 bubz systemd[1]: fwupd-refresh.service: Deactivated successfully.
Sep 27 10:13:53 bubz systemd[1]: Finished Refresh fwupd metadata and update motd.
Sep 27 10:15:22 bubz kernel: [ 3471.854231] xhci_hcd 0000:07:00.2: Unable to change power state from D3hot to D0, device inaccessible
Sep 27 10:15:23 bubz kernel: [ 3471.914676] xhci_hcd 0000:07:00.2: Unable to change power state from D3cold to D0, device inaccessible
Sep 27 10:15:23 bubz kernel: [ 3471.914687] xhci_hcd 0000:07:00.2: Controller not ready at resume -19
Sep 27 10:15:23 bubz kernel: [ 3471.914690] xhci_hcd 0000:07:00.2: PCI post-resume error -19!
Sep 27 10:15:23 bubz kernel: [ 3471.914693] xhci_hcd 0000:07:00.2: HC died; cleaning up
Sep 27 10:15:23 bubz kernel: [ 3472.430220] NVRM: GPU at PCI:0000:07:00: GPU-73994a87-f3a9-e97c-5add-f6a9813a6033
Sep 27 10:15:23 bubz kernel: [ 3472.430227] NVRM: Xid (PCI:0000:07:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Sep 27 10:15:23 bubz kernel: [ 3472.430242] NVRM: GPU 0000:07:00.0: GPU has fallen off the bus.
Sep 27 10:15:23 bubz kernel: [ 3472.430251] NVRM: A GPU crash dump has been created. If possible, please run
Sep 27 10:15:23 bubz kernel: [ 3472.430251] NVRM: nvidia-bug-report.sh as root to collect this data before
Sep 27 10:15:23 bubz kernel: [ 3472.430251] NVRM: the NVIDIA kernel module is unloaded.
Sep 27 10:15:26 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0801, 0x00013c88, 0x000655cc)
Sep 27 10:15:33 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0801, 0x00013c88, 0x000655cc)
Sep 27 10:15:36 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0800, 0x00013c88, 0x000655cc)
Sep 27 10:15:42 bubz steam[5558]: src/clientdll/steamengine.cpp (2764) : Assertion Failed: CSteamEngine::BMainLoop appears to have stalled > 15 seconds without event signalled
Sep 27 10:15:42 bubz steam[5558]: src/clientdll/steamengine.cpp (2764) : Assertion Failed: CSteamEngine::BMainLoop appears to have stalled > 15 seconds without event signalled
Sep 27 10:15:42 bubz steam[5558]: assert_20240927101542_135.dmp[11265]: Uploading dump (out-of-process)
Sep 27 10:15:42 bubz steam[5558]: /tmp/dumps/assert_20240927101542_135.dmp
Sep 27 10:15:42 bubz assert_20240927101542_135.dmp[11265]: Uploading dump (out-of-process)#012/tmp/dumps/assert_20240927101542_135.dmp
Sep 27 10:15:43 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0800, 0x00013c88, 0x000655cc)
Sep 27 10:15:44 bubz assert_20240927101542_135.dmp[11265]: Finished uploading minidump (out-of-process): success = yes
Sep 27 10:15:44 bubz assert_20240927101542_135.dmp[11265]: response: CrashID=bp-dda71263-0177-44dc-8ad7-9711c2240927
Sep 27 10:15:44 bubz steam[5558]: assert_20240927101542_135.dmp[11265]: Finished uploading minidump (out-of-process): success = yes
Sep 27 10:15:44 bubz steam[5558]: assert_20240927101542_135.dmp[11265]: response: CrashID=bp-dda71263-0177-44dc-8ad7-9711c2240927
Sep 27 10:15:44 bubz steam[5558]: assert_20240927101542_135.dmp[11265]: file ''/tmp/dumps/assert_20240927101542_135.dmp'', upload yes: ''CrashID=bp-dda71263-0177-44dc-8ad7-9711c2240927''
Sep 27 10:15:44 bubz assert_20240927101542_135.dmp[11265]: file ''/tmp/dumps/assert_20240927101542_135.dmp'', upload yes: ''CrashID=bp-dda71263-0177-44dc-8ad7-9711c2240927''
Sep 27 10:15:46 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0802, 0x00013c88, 0x00065678)
Sep 27 10:15:53 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0802, 0x00013c88, 0x00065678)
Sep 27 10:15:56 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0801, 0x00013c88, 0x00065678)
Sep 27 10:16:03 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0801, 0x00013c88, 0x00065678)
Sep 27 10:16:06 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0803, 0x00013c88, 0x0006571c)
Sep 27 10:16:13 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0803, 0x00013c88, 0x0006571c)
Sep 27 10:16:16 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0802, 0x00013c88, 0x0006571c)
Sep 27 10:16:23 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0802, 0x00013c88, 0x0006571c)
Sep 27 10:16:26 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0804, 0x00013c88, 0x000657c8)
Sep 27 10:16:33 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0804, 0x00013c88, 0x000657c8)
Sep 27 10:16:36 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0803, 0x00013c88, 0x000657c8)
Sep 27 10:16:43 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0803, 0x00013c88, 0x000657c8)
Sep 27 10:16:46 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0805, 0x00013c88, 0x0006586c)
Sep 27 10:16:53 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0805, 0x00013c88, 0x0006586c)
Sep 27 10:16:56 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0804, 0x00013c88, 0x0006586c)
Sep 27 10:17:01 bubz CRON[11287]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Sep 27 10:17:03 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0804, 0x00013c88, 0x0006586c)
Sep 27 10:17:06 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0806, 0x00013c88, 0x00065918)
Sep 27 10:17:13 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0806, 0x00013c88, 0x00065918)
Sep 27 10:17:16 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0805, 0x00013c88, 0x00065918)
Sep 27 10:17:23 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0805, 0x00013c88, 0x00065918)
Sep 27 10:17:23 bubz kernel: [ 3592.297336] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:17:23 bubz kernel: [ 3592.297351] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:17:26 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0807, 0x00013c88, 0x0006b1e4)
Sep 27 10:17:33 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0807, 0x00013c88, 0x0006b1e4)
Sep 27 10:17:36 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0806, 0x00013c88, 0x0006b1e4)
Sep 27 10:17:43 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0806, 0x00013c88, 0x0006b1e4)
Sep 27 10:17:46 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0808, 0x00013c88, 0x0006b290)
Sep 27 10:17:53 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0808, 0x00013c88, 0x0006b290)
Sep 27 10:17:56 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0807, 0x00013c88, 0x0006b290)
Sep 27 10:18:03 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0807, 0x00013c88, 0x0006b290)
Sep 27 10:18:03 bubz FFPWA-01J3VMRENE1PASWHSFQAGPMBTF.desktop[4725]: [GFX1-]: Detect DeviceReset DeviceResetReason::RESET DeviceResetDetectPlace::WR_POST_UPDATE in Parent process
Sep 27 10:18:06 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0809, 0x00013c88, 0x00075350)
Sep 27 10:18:13 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0809, 0x00013c88, 0x00075350)
Sep 27 10:18:16 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0808, 0x00013c88, 0x00075350)
Sep 27 10:18:23 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0808, 0x00013c88, 0x00075350)
Sep 27 10:18:23 bubz steam[5558]: src/common/pipes.cpp (883) : fatal stalled cross-thread pipe.
Sep 27 10:18:23 bubz steam[5558]: src/common/pipes.cpp (883) : fatal stalled cross-thread pipe.
Sep 27 10:18:23 bubz steam[5558]: src/common/pipes.cpp (883) : Fatal assert; application exiting
Sep 27 10:18:23 bubz steam[5558]: src/common/pipes.cpp (883) : Fatal assert; application exiting
Sep 27 10:18:23 bubz steam[5558]: 09/27 10:18:23 Init: Installing breakpad exception handler for appid(steam)/version(1726604483)/tid(5584)
Sep 27 10:18:24 bubz steam[5558]: assert_20240927101823_138.dmp[11337]: Uploading dump (out-of-process)
Sep 27 10:18:24 bubz steam[5558]: /tmp/dumps/assert_20240927101823_138.dmp
Sep 27 10:18:24 bubz assert_20240927101823_138.dmp[11337]: Uploading dump (out-of-process)#012/tmp/dumps/assert_20240927101823_138.dmp
Sep 27 10:18:25 bubz assert_20240927101823_138.dmp[11337]: Finished uploading minidump (out-of-process): success = yes
Sep 27 10:18:25 bubz assert_20240927101823_138.dmp[11337]: response: CrashID=bp-39a87953-c883-43f5-bce3-ca4342240927
Sep 27 10:18:25 bubz steam[5558]: assert_20240927101823_138.dmp[11337]: Finished uploading minidump (out-of-process): success = yes
Sep 27 10:18:25 bubz steam[5558]: assert_20240927101823_138.dmp[11337]: response: CrashID=bp-39a87953-c883-43f5-bce3-ca4342240927
Sep 27 10:18:25 bubz steam[5558]: assert_20240927101823_138.dmp[11337]: file ''/tmp/dumps/assert_20240927101823_138.dmp'', upload yes: ''CrashID=bp-39a87953-c883-43f5-bce3-ca4342240927''
Sep 27 10:18:25 bubz assert_20240927101823_138.dmp[11337]: file ''/tmp/dumps/assert_20240927101823_138.dmp'', upload yes: ''CrashID=bp-39a87953-c883-43f5-bce3-ca4342240927''
Sep 27 10:18:26 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x080a, 0x00013c88, 0x000753fc)
Sep 27 10:18:30 bubz systemd[2123]: app-gnome-steam-5445.scope: Consumed 4h 58min 1.539s CPU time.
Sep 27 10:18:33 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x080a, 0x00013c88, 0x000753fc)
Sep 27 10:18:36 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0809, 0x00013c88, 0x000753fc)
Sep 27 10:18:43 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0809, 0x00013c88, 0x000753fc)
Sep 27 10:18:46 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x080b, 0x00013c88, 0x000754a0)
Sep 27 10:18:53 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x080b, 0x00013c88, 0x000754a0)
Sep 27 10:18:56 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x080a, 0x00013c88, 0x000754a0)
Sep 27 10:19:03 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x080a, 0x00013c88, 0x000754a0)
Sep 27 10:19:03 bubz kernel: [ 3692.308881] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:19:03 bubz kernel: [ 3692.308897] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:19:06 bubz /usr/libexec/gdm-x-session[2164]: (EE) NVIDIA(GPU-0): WAIT (2, 8, 0x8000, 0x00013c88, 0x00078040)
Sep 27 10:19:13 bubz /usr/libexec/gdm-x-session[2164]: (EE) NVIDIA(GPU-0): WAIT (1, 8, 0x8000, 0x00013c88, 0x00078040)
Sep 27 10:19:13 bubz kernel: [ 3702.310009] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.310027] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.310995] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.311007] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.311347] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.311358] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.311563] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.311573] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.311748] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.311761] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.312134] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.312145] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.312324] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.312334] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.312591] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.312601] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.312921] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.312931] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.313188] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.313198] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.313988] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.313998] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.314546] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.314556] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.315204] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.315215] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.315488] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.315498] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.323956] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.323967] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.324058] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:19:13 bubz kernel: [ 3702.324067] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:19:16 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x080c, 0x00013c88, 0x0007cd9c)
Sep 27 10:19:23 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x080c, 0x00013c88, 0x0007cd9c)
Sep 27 10:19:26 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x080b, 0x00013c88, 0x0007cd9c)
Sep 27 10:19:33 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x080b, 0x00013c88, 0x0007cd9c)
Sep 27 10:19:36 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x080d, 0x00013c88, 0x0007ce48)
Sep 27 10:19:43 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x080d, 0x00013c88, 0x0007ce48)
Sep 27 10:19:46 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x080c, 0x00013c88, 0x0007ce48)
Sep 27 10:19:53 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x080c, 0x00013c88, 0x0007ce48)
Sep 27 10:19:56 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x080e, 0x00013c88, 0x0007cef4)
Sep 27 10:20:03 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x080e, 0x00013c88, 0x0007cef4)
Sep 27 10:20:06 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x080d, 0x00013c88, 0x0007cef4)
Sep 27 10:20:13 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x080d, 0x00013c88, 0x0007cef4)
Sep 27 10:20:16 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x080f, 0x00013c88, 0x00003388)
Sep 27 10:20:23 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x080f, 0x00013c88, 0x00003388)
Sep 27 10:20:26 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x080e, 0x00013c88, 0x00003388)
Sep 27 10:20:33 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x080e, 0x00013c88, 0x00003388)
Sep 27 10:20:36 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0810, 0x00013c88, 0x0000342c)
Sep 27 10:20:43 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0810, 0x00013c88, 0x0000342c)
Sep 27 10:20:46 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x080f, 0x00013c88, 0x0000342c)
Sep 27 10:20:53 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x080f, 0x00013c88, 0x0000342c)
Sep 27 10:20:53 bubz kernel: [ 3802.336137] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:20:53 bubz kernel: [ 3802.336154] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:20:53 bubz kernel: [ 3802.336326] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:20:53 bubz kernel: [ 3802.336337] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:20:53 bubz kernel: [ 3802.345740] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:20:53 bubz kernel: [ 3802.345753] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:20:53 bubz kernel: [ 3802.346496] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:20:53 bubz kernel: [ 3802.346527] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:20:53 bubz kernel: [ 3802.347138] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:20:53 bubz kernel: [ 3802.347162] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:20:53 bubz kernel: [ 3802.347475] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:20:53 bubz kernel: [ 3802.347503] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:20:53 bubz FFPWA-01J3VMRENE1PASWHSFQAGPMBTF.desktop[4725]: Unflushed glGetGraphicsResetStatus: 0x92bb
Sep 27 10:20:53 bubz kernel: [ 3802.359133] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:20:53 bubz kernel: [ 3802.359148] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:20:53 bubz gnome-shell[2336]: Window manager warning: Failed to start restart helper: Failed to execute child process “/usr/libexec/mutter-restart-helper” (No such file or directory)
Sep 27 10:20:56 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0811, 0x00013c88, 0x000095a0)
Sep 27 10:21:03 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0811, 0x00013c88, 0x000095a0)
Sep 27 10:21:06 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0810, 0x00013c88, 0x000095a0)
Sep 27 10:21:13 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0810, 0x00013c88, 0x000095a0)
Sep 27 10:21:16 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0812, 0x00013c88, 0x00009644)
Sep 27 10:21:23 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0812, 0x00013c88, 0x00009644)
Sep 27 10:21:26 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0811, 0x00013c88, 0x00009644)
Sep 27 10:21:33 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0811, 0x00013c88, 0x00009644)
Sep 27 10:21:36 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0813, 0x00013c88, 0x000096e8)
Sep 27 10:21:43 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0813, 0x00013c88, 0x000096e8)
Sep 27 10:21:46 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0812, 0x00013c88, 0x000096e8)
Sep 27 10:21:53 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0812, 0x00013c88, 0x000096e8)
Sep 27 10:21:56 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0814, 0x00013c88, 0x0000978c)
Sep 27 10:22:03 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0814, 0x00013c88, 0x0000978c)
Sep 27 10:22:06 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0813, 0x00013c88, 0x000097c0)
Sep 27 10:22:13 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0813, 0x00013c88, 0x000097c0)
Sep 27 10:22:16 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0815, 0x00013c88, 0x000097f4)
Sep 27 10:22:23 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0815, 0x00013c88, 0x000097f4)
Sep 27 10:22:26 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0814, 0x00013c88, 0x00009828)
Sep 27 10:22:33 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0814, 0x00013c88, 0x00009828)
Sep 27 10:22:36 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0816, 0x00013c88, 0x0000985c)
Sep 27 10:22:43 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0816, 0x00013c88, 0x0000985c)
Sep 27 10:22:46 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0815, 0x00013c88, 0x00009890)
Sep 27 10:22:53 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0815, 0x00013c88, 0x00009890)
Sep 27 10:22:56 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0817, 0x00013c88, 0x000098c4)
Sep 27 10:23:03 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0817, 0x00013c88, 0x000098c4)
Sep 27 10:23:06 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0816, 0x00013c88, 0x000098f8)
Sep 27 10:23:13 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0816, 0x00013c88, 0x000098f8)
Sep 27 10:23:16 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0818, 0x00013c88, 0x0000992c)
Sep 27 10:23:23 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0818, 0x00013c88, 0x0000992c)
Sep 27 10:23:26 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0817, 0x00013c88, 0x00009960)
Sep 27 10:23:33 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0817, 0x00013c88, 0x00009960)
Sep 27 10:23:36 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0819, 0x00013c88, 0x00009994)
Sep 27 10:23:43 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0819, 0x00013c88, 0x00009994)
Sep 27 10:23:46 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0818, 0x00013c88, 0x000099c8)
Sep 27 10:23:53 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0818, 0x00013c88, 0x000099c8)
Sep 27 10:23:56 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x081a, 0x00013c88, 0x000099fc)
Sep 27 10:24:03 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x081a, 0x00013c88, 0x000099fc)
Sep 27 10:24:06 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0819, 0x00013c88, 0x00009a30)
Sep 27 10:24:13 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0819, 0x00013c88, 0x00009a30)
Sep 27 10:24:16 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x081b, 0x00013c88, 0x00009a64)
Sep 27 10:24:23 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x081b, 0x00013c88, 0x00009a64)
Sep 27 10:24:26 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x081a, 0x00013c88, 0x00009aa0)
Sep 27 10:24:33 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x081a, 0x00013c88, 0x00009aa0)
Sep 27 10:24:36 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x081c, 0x00013c88, 0x00009aa0)
Sep 27 10:24:43 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x081c, 0x00013c88, 0x00009aa0)
Sep 27 10:24:46 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x081d, 0x00013c88, 0x00009b44)
Sep 27 10:24:53 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x081d, 0x00013c88, 0x00009b44)
Sep 27 10:24:56 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x081b, 0x00013c88, 0x00009b78)
Sep 27 10:25:03 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x081b, 0x00013c88, 0x00009b78)
Sep 27 10:25:06 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x081e, 0x00013c88, 0x00009bac)
Sep 27 10:25:13 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x081e, 0x00013c88, 0x00009bac)
Sep 27 10:25:16 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x081c, 0x00013c88, 0x00009be0)
Sep 27 10:25:23 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x081c, 0x00013c88, 0x00009be0)
Sep 27 10:25:26 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x081f, 0x00013c88, 0x00009c14)
Sep 27 10:25:33 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x081f, 0x00013c88, 0x00009c14)
Sep 27 10:25:36 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x081d, 0x00013c88, 0x00009c48)
Sep 27 10:25:43 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x081d, 0x00013c88, 0x00009c48)
Sep 27 10:25:46 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0820, 0x00013c88, 0x00009c7c)
Sep 27 10:25:53 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0820, 0x00013c88, 0x00009c7c)
Sep 27 10:25:56 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x081e, 0x00013c88, 0x00009cb0)
Sep 27 10:26:03 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x081e, 0x00013c88, 0x00009cb0)
Sep 27 10:26:06 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0821, 0x00013c88, 0x00009ce4)
Sep 27 10:26:13 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0821, 0x00013c88, 0x00009ce4)
Sep 27 10:26:16 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x081f, 0x00013c88, 0x00009d18)
Sep 27 10:26:23 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x081f, 0x00013c88, 0x00009d18)
Sep 27 10:26:26 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0822, 0x00013c88, 0x00009d4c)
Sep 27 10:26:33 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0822, 0x00013c88, 0x00009d4c)
Sep 27 10:26:36 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0820, 0x00013c88, 0x00009d80)
Sep 27 10:26:43 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0820, 0x00013c88, 0x00009d80)
Sep 27 10:26:46 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0823, 0x00013c88, 0x00009db4)
Sep 27 10:26:53 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0823, 0x00013c88, 0x00009db4)
Sep 27 10:26:56 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0821, 0x00013c88, 0x00009de8)
Sep 27 10:27:03 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0821, 0x00013c88, 0x00009de8)
Sep 27 10:27:06 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0824, 0x00013c88, 0x00009e1c)
Sep 27 10:27:13 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0824, 0x00013c88, 0x00009e1c)
Sep 27 10:27:16 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0822, 0x00013c88, 0x00009e58)
Sep 27 10:27:23 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0822, 0x00013c88, 0x00009e58)
Sep 27 10:27:26 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0825, 0x00013c88, 0x00009e58)
Sep 27 10:27:33 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0825, 0x00013c88, 0x00009e58)
Sep 27 10:27:36 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0826, 0x00013c88, 0x00009efc)
Sep 27 10:27:43 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0826, 0x00013c88, 0x00009efc)
Sep 27 10:27:46 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0823, 0x00013c88, 0x00009efc)
Sep 27 10:27:53 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0823, 0x00013c88, 0x00009efc)
Sep 27 10:27:56 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0827, 0x00013c88, 0x00009fa0)
Sep 27 10:28:03 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0827, 0x00013c88, 0x00009fa0)
Sep 27 10:28:06 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0824, 0x00013c88, 0x00009fa0)
Sep 27 10:28:13 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0824, 0x00013c88, 0x00009fa0)
Sep 27 10:28:16 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0828, 0x00013c88, 0x0000a044)
Sep 27 10:28:23 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0828, 0x00013c88, 0x0000a044)
Sep 27 10:28:26 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0825, 0x00013c88, 0x0000a044)
Sep 27 10:28:33 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0825, 0x00013c88, 0x0000a044)
Sep 27 10:28:36 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0829, 0x00013c88, 0x0000a0e8)
Sep 27 10:28:43 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0829, 0x00013c88, 0x0000a0e8)
Sep 27 10:28:46 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0826, 0x00013c88, 0x0000a11c)
Sep 27 10:28:53 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0826, 0x00013c88, 0x0000a11c)
Sep 27 10:28:56 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x082a, 0x00013c88, 0x0000a150)
Sep 27 10:29:03 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x082a, 0x00013c88, 0x0000a150)
Sep 27 10:29:06 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0827, 0x00013c88, 0x0000a184)
Sep 27 10:29:13 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0827, 0x00013c88, 0x0000a184)
Sep 27 10:29:16 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x082b, 0x00013c88, 0x0000a1b8)
Sep 27 10:29:23 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x082b, 0x00013c88, 0x0000a1b8)
Sep 27 10:29:26 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0828, 0x00013c88, 0x0000a1ec)
Sep 27 10:29:33 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0828, 0x00013c88, 0x0000a1ec)
Sep 27 10:29:36 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x082c, 0x00013c88, 0x0000a220)
Sep 27 10:29:43 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x082c, 0x00013c88, 0x0000a220)
Sep 27 10:29:46 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0829, 0x00013c88, 0x0000a254)
Sep 27 10:29:53 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0829, 0x00013c88, 0x0000a254)
Sep 27 10:29:56 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x082d, 0x00013c88, 0x0000a288)
Sep 27 10:30:03 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x082d, 0x00013c88, 0x0000a288)
Sep 27 10:30:06 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x082a, 0x00013c88, 0x0000a2bc)
Sep 27 10:30:13 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x082a, 0x00013c88, 0x0000a2bc)
Sep 27 10:30:16 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x082e, 0x00013c88, 0x0000a2f8)
Sep 27 10:30:23 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x082e, 0x00013c88, 0x0000a2f8)
Sep 27 10:30:26 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x082b, 0x00013c88, 0x0000a2f8)
Sep 27 10:30:33 bubz /usr/libexec/gdm-x-session[2164]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x082b, 0x00013c88, 0x0000a2f8)
Sep 27 10:30:33 bubz kernel: [ 4382.420896] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:30:33 bubz kernel: [ 4382.420913] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:30:33 bubz kernel: [ 4382.421209] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:30:33 bubz kernel: [ 4382.421223] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:30:33 bubz kernel: [ 4382.424278] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:30:33 bubz kernel: [ 4382.424293] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:30:33 bubz FFPWA-01J3VMRENE1PASWHSFQAGPMBTF.desktop[4725]: [GFX1-]: Failed to create EGLSurface!: 0x3003
Sep 27 10:30:33 bubz FFPWA-01J3VMRENE1PASWHSFQAGPMBTF.desktop[4725]: [GFX1-]: Failed to create EGLSurface. 1 renderers, 1 active.
Sep 27 10:30:33 bubz kernel: [ 4382.433366] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:30:33 bubz kernel: [ 4382.433384] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:30:33 bubz FFPWA-01J3VMRENE1PASWHSFQAGPMBTF.desktop[4725]: [GFX1-]: Handling webrender error 3
Sep 27 10:30:33 bubz FFPWA-01J3VMRENE1PASWHSFQAGPMBTF.desktop[4725]: [GFX1-]: Fallback WR to SW-WR
Sep 27 10:30:33 bubz gnome-shell[2336]: Window manager warning: META_CURRENT_TIME used to choose focus window; focus window may not be correct.
Sep 27 10:30:33 bubz kernel: [ 4382.452944] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 27 10:30:33 bubz kernel: [ 4382.452968] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 27 10:30:36 bubz /usr/libexec/gdm-x-session[2164]: (EE) NVIDIA(GPU-0): WAIT (2, 8, 0x8000, 0x00013c88, 0x0000a644)
Sep 27 10:30:43 bubz /usr/libexec/gdm-x-session[2164]: (EE) NVIDIA(GPU-0): WAIT (1, 8, 0x8000, 0x00013c88, 0x0000a644)
Sep 27 10:33:04 bubz xdg-desktop-por[2447]: Failed to get application states: GDBus.Error:org.freedesktop.portal.Error.Failed: Could not get window list

GPU temps were normal (56 C or under), usage was at 3%-12%, and VRAM usage was 700 MB of 6 GB. System uptime was under 3 hours.

I'll refresh Pop through an older ISO I found and hold back all updates and see what happens.

Vetpetmon commented 1 month ago

I have reinstalled Pop!_OS 22.04 with a live USB image from July:

Linux bubz 6.9.3-76060903-generic #202405300957~1718348209~22.04~7817b67 SMP PREEMPT_DYNAMIC Mon J x86_64 x86_64 x86_64 GNU/Linux

My NVIDIA drivers are now back on v550.67

I have preemptively held back the kernel and driver updates, and tried running a GPU-intensive app for 1 hour twice today, waiting 15 minutes in between, and had uptime for 1 hour afterwards (then a power outage had me rushing to turn off the PC). GPU has not fallen off the bus so far. I know it happens semi-randomly, but one way to best reproduce the error was was to leave the PC alone after closing a 3D game. Sometimes, the kernel log would say the pcieport drivers would precede, other times it did not trigger AER and instead, the kernel would just report that xhci_hcd was unable to change power states before the GPU falls off the bus.

If I still cannot replicate the issue in 2 weeks, that should serve as confirmation that it's an issue with the kernel or the drivers provided by System76 as of late August, and the only workaround is to have an older ISO file lying around, which still isn't an ideal situation.

ziprasidone146939277 commented 1 month ago

Since you have System76 hardware, I recommend opening a support ticket: https://support.system76.com

I've been having several freeze problems with my oryxp9 system and hardware for quite some time, thinking it was some kind of kernel or nvidia driver problem. Now, after the release of 560.35.03, the problems persist.

I would like to open a support ticket but I can't find how to do it https://support.system76.com/, because the links create a "loop" in the search, is there an email? In case it helps, I'm leaving the journal at the exact moment of the crash.

My scenario is very similar to the theories in Vetpetmon's last comment.

Greetings

Sep 28 23:34:22 oryx.wonder.boy /usr/libexec/gdm-x-session[2865]: (WW) modeset(0): Present-flip: queue async flip during flip on CRTC 0 failed: Invalid argument
Sep 28 23:34:22 oryx.wonder.boy /usr/libexec/gdm-x-session[2865]: (WW) modeset(0): Present-flip: queue async flip during flip on CRTC 0 failed: Invalid argument
Sep 28 23:34:22 oryx.wonder.boy /usr/libexec/gdm-x-session[2865]: (WW) modeset(0): Present-flip: queue async flip during flip on CRTC 0 failed: Invalid argument
Sep 28 23:34:22 oryx.wonder.boy /usr/libexec/gdm-x-session[2865]: (WW) modeset(0): Present-flip: queue async flip during flip on CRTC 0 failed: Invalid argument
Sep 28 23:34:22 oryx.wonder.boy /usr/libexec/gdm-x-session[2865]: (WW) modeset(0): Present-flip: queue async flip during flip on CRTC 0 failed: Invalid argument
Sep 28 23:34:22 oryx.wonder.boy /usr/libexec/gdm-x-session[2865]: (WW) modeset(0): Present-flip: queue async flip during flip on CRTC 0 failed: Invalid argument
Sep 28 23:34:22 oryx.wonder.boy /usr/libexec/gdm-x-session[2865]: (WW) modeset(0): Present-flip: queue async flip during flip on CRTC 0 failed: Invalid argument
Sep 28 23:34:22 oryx.wonder.boy /usr/libexec/gdm-x-session[2865]: (WW) modeset(0): Present-flip: queue async flip during flip on CRTC 0 failed: Invalid argument
Sep 28 23:34:22 oryx.wonder.boy /usr/libexec/gdm-x-session[2865]: (WW) modeset(0): Present-flip: queue async flip during flip on CRTC 0 failed: Invalid argument
Sep 28 23:34:22 oryx.wonder.boy /usr/libexec/gdm-x-session[2865]: (WW) modeset(0): Present-flip: queue async flip during flip on CRTC 0 failed: Invalid argument
Sep 28 23:34:22 oryx.wonder.boy /usr/libexec/gdm-x-session[2865]: (WW) modeset(0): Present-flip: queue async flip during flip on CRTC 0 failed: Invalid argument
Sep 28 23:34:22 oryx.wonder.boy /usr/libexec/gdm-x-session[2865]: (WW) modeset(0): Present-flip: detected too frequent flip errors, disabling logs until frequency is reduced
Sep 28 23:34:22 oryx.wonder.boy /usr/libexec/gdm-x-session[2865]: (WW) modeset(0): flip queue retry
Sep 28 23:34:25 oryx.wonder.boy /usr/libexec/gdm-x-session[2865]: (WW) modeset(0): flip queue retry
Sep 28 23:34:25 oryx.wonder.boy /usr/libexec/gdm-x-session[2865]: (WW) modeset(0): flip queue retry
Sep 28 23:34:26 oryx.wonder.boy kernel: NVRM: GPU at PCI:0000:01:00: GPU-027fb08b-c156-d288-24dc-9faa6eefa32e
Sep 28 23:34:26 oryx.wonder.boy kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Sep 28 23:34:26 oryx.wonder.boy kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Sep 28 23:34:26 oryx.wonder.boy kernel: NVRM: GPU0 GSP RPC buffer contains function 78 (DUMP_PROTOBUF_COMPONENT) and data 0x0000000000000000 0x0000000000000000.
Sep 28 23:34:26 oryx.wonder.boy kernel: NVRM: GPU0 RPC history (CPU -> GSP):
Sep 28 23:34:26 oryx.wonder.boy kernel: NVRM:     entry function                   data0              data1              ts_start           ts_end             duration actively_polling
Sep 28 23:34:26 oryx.wonder.boy kernel: NVRM:      0    76   GSP_RM_CONTROL        0x000000002080a0d1 0x0000000000000658 0x00062338ed47114c 0x0000000000000000          y
Sep 28 23:34:26 oryx.wonder.boy kernel: NVRM:     -1    76   GSP_RM_CONTROL        0x000000002080a0d1 0x0000000000000658 0x00062338ed37b1ce 0x00062338ed37b831   1635us  
Sep 28 23:34:26 oryx.wonder.boy kernel: NVRM:     -2    76   GSP_RM_CONTROL        0x000000002080e634 0x0000000000000188 0x00062338ed33e72d 0x00062338ed342142  14869us  
Sep 28 23:34:26 oryx.wonder.boy kernel: NVRM:     -3    76   GSP_RM_CONTROL        0x000000002080a7d7 0x0000000000000002 0x00062338ed298a54 0x00062338ed298d5a    774us  
Sep 28 23:34:26 oryx.wonder.boy kernel: NVRM:     -4    76   GSP_RM_CONTROL        0x000000002080a0d1 0x0000000000000658 0x00062338ed284cf1 0x00062338ed2851e8   1271us  
Sep 28 23:34:26 oryx.wonder.boy kernel: NVRM:     -5    76   GSP_RM_CONTROL        0x000000002080a0c5 0x0000000000000510 0x00062338ed1a2c97 0x00062338ed1a3008    881us  
Sep 28 23:34:26 oryx.wonder.boy kernel: NVRM:     -6    76   GSP_RM_CONTROL        0x000000002080a0d1 0x0000000000000658 0x00062338ed187905 0x00062338ed187a7f    378us  
Sep 28 23:34:26 oryx.wonder.boy kernel: NVRM:     -7    76   GSP_RM_CONTROL        0x000000002080e634 0x0000000000000188 0x00062338ed0c1b00 0x00062338ed0c2c57   4439us  
Sep 28 23:34:26 oryx.wonder.boy kernel: NVRM: GPU0 RPC event history (CPU <- GSP):
Sep 28 23:34:26 oryx.wonder.boy kernel: NVRM:     entry function                   data0              data1              ts_start           ts_end             duration during_incomplete_rpc
Sep 28 23:34:26 oryx.wonder.boy kernel: NVRM:      0    4099 POST_EVENT            0x0000000000000021 0x0000000000000020 0x00062338ed37b441 0x00062338ed37b45c     27us  
Sep 28 23:34:26 oryx.wonder.boy kernel: NVRM:     -1    4099 POST_EVENT            0x0000000000000021 0x0000000000000100 0x00062338ed326151 0x00062338ed326168     23us  
Sep 28 23:34:26 oryx.wonder.boy kernel: NVRM:     -2    4099 POST_EVENT            0x0000000000000021 0x0000000000000020 0x00062338ed284f16 0x00062338ed284f27     17us  
Sep 28 23:34:26 oryx.wonder.boy kernel: NVRM:     -3    4099 POST_EVENT            0x0000000000000021 0x0000000000000001 0x00062338ed1a2d33 0x00062338ed1a2d47     20us  
Sep 28 23:34:26 oryx.wonder.boy kernel: NVRM:     -4    4099 POST_EVENT            0x0000000000000021 0x0000000000000008 0x00062338ed1153f6 0x00062338ed115414     30us  
Sep 28 23:34:26 oryx.wonder.boy kernel: NVRM:     -5    4099 POST_EVENT            0x0000000000000021 0x0000000000000001 0x00062338ed0c1a38 0x00062338ed0c1a56     30us  
Sep 28 23:34:26 oryx.wonder.boy kernel: NVRM:     -6    4099 POST_EVENT            0x0000000000000021 0x0000000000000100 0x00062338baa62326 0x00062338baa62338     18us  
Sep 28 23:34:26 oryx.wonder.boy kernel: NVRM:     -7    4099 POST_EVENT            0x0000000000000021 0x0000000000000020 0x00062338ba9eb027 0x00062338ba9eb044     29us  
Sep 28 23:34:26 oryx.wonder.boy kernel: CPU: 4 PID: 14264 Comm: [vkps] Update Tainted: P           OE      6.9.3-76060903-generic #202405300957~1726766035~22.04~4092a0e
Sep 28 23:34:26 oryx.wonder.boy kernel: Hardware name: System76 Oryx Pro/Oryx Pro, BIOS 2023-09-08_42bf7a6 09/08/2023
Sep 28 23:34:26 oryx.wonder.boy kernel: Call Trace:
Sep 28 23:34:26 oryx.wonder.boy kernel:  <TASK>
Sep 28 23:34:26 oryx.wonder.boy kernel:  dump_stack_lvl+0x76/0xa0
Sep 28 23:34:26 oryx.wonder.boy kernel:  dump_stack+0x10/0x20
Sep 28 23:34:26 oryx.wonder.boy kernel:  os_dump_stack+0xe/0x20 [nvidia]
Sep 28 23:34:26 oryx.wonder.boy kernel:  _nv012948rm+0x2c5/0x590 [nvidia]
Sep 28 23:34:26 oryx.wonder.boy kernel: WARNING: kernel stack frame pointer at 000000009a629704 in [vkps] Update:14264 has bad value 00000000d67f6c87
Sep 28 23:34:26 oryx.wonder.boy kernel: unwind stack type:0 next_sp:0000000000000000 mask:0x2 graph_idx:0
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000078a74efc: ffffc031c2d7b978 (0xffffc031c2d7b978)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000005a9de2c8: ffffffffb185af39 (show_trace_log_lvl+0x269/0x420)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000002b3fa54: ffffffffb3393e31 (linux_banner+0x3f2db1/0x40b560)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000038a82ab8: ffff99dafb49d200 (0xffff99dafb49d200)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000e7058fa1: ffffffffb33c6821 (SIGMA2+0x19de1/0x142400)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000c9b801b7: ffffc031c2d7b9d0 (0xffffc031c2d7b9d0)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000040238f84: 000000000000003d (0x3d)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000005c2626f8: 0000000000000002 (0x2)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000059aa4c7f: 0000000000000001 (0x1)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000802cea19: ffffc031c2d78000 (0xffffc031c2d78000)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000224c2bfd: ffffc031c2d7c000 (0xffffc031c2d7c000)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000009c999318: 0000000000000000 ...
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000087221429: ffffc031c2d78000 (0xffffc031c2d78000)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000fd309b04: ffffc031c2d7c000 (0xffffc031c2d7c000)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000619f8250: 0000000000000000 ...
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000280e5718: 0000000000000002 (0x2)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000420bd599: ffff99dafb49d200 (0xffff99dafb49d200)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000000dafee0b: 0000000000000000 ...
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000004071f621: 0000000000000001 (0x1)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000000d22e0dd: ffffc031c2d7b9c8 (0xffffc031c2d7b9c8)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000c1bd72a9: ffffc031c2d7b878 (0xffffc031c2d7b878)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000049657040: ffffffffc1edfca5 (_nv012948rm+0x2c5/0x590 [nvidia])
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000e06d6856: 0000000000000000 ...
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000006bacd85b: ab98f14b84a55600 (0xab98f14b84a55600)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000002274aec1: 0000000000000246 (0x246)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000d72fc501: ffffffffb33c6821 (SIGMA2+0x19de1/0x142400)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000001feb3c04: ffff99da16d84008 (0xffff99da16d84008)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000cad595e0: 000000000000004c (0x4c)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000009c4fc9af: 000000000000003d (0x3d)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000008263a5c3: ffffc031c2d7b988 (0xffffc031c2d7b988)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000009c7c8b09: ffffffffb185b210 (show_stack+0x20/0x70)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000006ffbb685: ffffc031c2d7b9a8 (0xffffc031c2d7b9a8)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000008d2d6f31: ffffffffb2976ce6 (dump_stack_lvl+0x76/0xa0)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000d2306626: ffff99da20840008 (0xffff99da20840008)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000502567ae: ffff99da19bf0008 (0xffff99da19bf0008)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000078d5325f: ffffc031c2d7b9b8 (0xffffc031c2d7b9b8)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000419e5b57: ffffffffb2976d30 (dump_stack+0x10/0x20)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000dc11a421: ffffc031c2d7b9c8 (0xffffc031c2d7b9c8)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000001eb4e7cc: ffffffffc188dace (os_dump_stack+0xe/0x20 [nvidia])
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000009a629704: ffff99daae455d70 (0xffff99daae455d70)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000069e4a8f4: ffffffffc1edfca5 (_nv012948rm+0x2c5/0x590 [nvidia])
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000ffac8517: 000000002080a0d1 (0x2080a0d1)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000e4db212e: 0000000000000658 (0x658)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000d9dfa415: ffff99da16d84008 (0xffff99da16d84008)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000008cef75de: 0000000000000000 ...
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000943b0fda: ffff99da19bf0008 (0xffff99da19bf0008)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000000978bfc: 0000000000000000 ...
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000bbbd3534: 000000000000004c (0x4c)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000c7da3938: ffffffffc2202667 (_nv012865rm+0x77/0x330 [nvidia])
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000a2837ee0: ffff99da20861050 (0xffff99da20861050)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000004004cac6: ffff99daae455df0 (0xffff99daae455df0)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000fed739cd: ffff99da19bf0008 (0xffff99da19bf0008)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000009c354a4d: 000000002080a0d1 (0x2080a0d1)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000d3f7eea7: ffff99da16d84008 (0xffff99da16d84008)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000009b1203cd: ffffffffc221e7df (_nv048628rm+0x49f/0x7f0 [nvidia])
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000052c5b5d4: ffff99da0162b160 (0xffff99da0162b160)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000dc332478: 0000000000000000 ...
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000023f2a53b: ffffc031c2d7bac8 (0xffffc031c2d7bac8)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000bec6ce9e: ffff99daae464808 (0xffff99daae464808)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000b8f1cbf0: 0000000000000020 (0x20)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000730c6f8a: ffffffffc197df43 (_nv000720rm+0x173/0x320 [nvidia])
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000ec513abb: ffffffffc263cce0 (nv_kthread_q+0x40/0xfffffffffff16360 [nvidia])
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000c3e0ba0e: ffff99da0162b160 (0xffff99da0162b160)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000001f4da512: ffffffffc197ddd0 (_nv000691rm+0x1a0/0x1a0 [nvidia])
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000008f4feb8a: 0000000000000000 ...
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000f9874e2a: ffffffffc263cce0 (nv_kthread_q+0x40/0xfffffffffff16360 [nvidia])
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000589ab263: ffffffffc1930cbd (_nv013137rm+0x3d/0xa0 [nvidia])
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000007b4a2d0c: ffff99dbae3d6000 (0xffff99dbae3d6000)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000bb58bbe6: 000000000000002a (0x2a)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000002617a440: ffff99da0162b160 (0xffff99da0162b160)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000042263253: ffffffffc2440262 (_nv000745rm+0x8d2/0xe00 [nvidia])
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000004f84e337: 0000000000000000 ...
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000073de9f61: ffff99dbae3d6000 (0xffff99dbae3d6000)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000047c02aae: 0000000000000000 ...
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000bb53d8b1: ffff99daae453000 (0xffff99daae453000)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000e6108fa3: 000000000000002a (0x2a)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000a670c5c0: 0000000000000020 (0x20)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000da288e80: ffffc031c2d7bc70 (0xffffc031c2d7bc70)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000a0d1da38: ffff99da0162b160 (0xffff99da0162b160)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000008d868161: ffffffffc2446dbf (rm_ioctl+0x7f/0x400 [nvidia])
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000003c42bc13: ffffc031c2d7bb18 (0xffffc031c2d7bb18)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000a39cbbee: ffffffffc263cce0 (nv_kthread_q+0x40/0xfffffffffff16360 [nvidia])
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000feb9c5ff: ffff99dbae3d6000 (0xffff99dbae3d6000)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000b4af9574: ffffc031c2d7bb28 (0xffffc031c2d7bb28)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000a8be0e43: ffff99dafb49d200 (0xffff99dafb49d200)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000006c8ac656: ffff99da02811b44 (0xffff99da02811b44)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000004c8d5e7e: 00000000000037b8 (0x37b8)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000c32da55a: 00000001000a53a4 (0x1000a53a4)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000066a0621e: 000f499e9474c1c0 (0xf499e9474c1c0)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000354f306e: 000f499f82dfe9c0 (0xf499f82dfe9c0)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000020ed5885: 000f49a590986dc0 (0xf49a590986dc0)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000bd4b7a43: 000f499f0baa55c0 (0xf499f0baa55c0)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000a550cd06: 0000000000000000 ...
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000041084e10: 0000012000000004 (0x12000000004)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000949aee72: 00000000000037b8 (0x37b8)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000042bf9361: 0000000000000000 ...
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000ea78acbd: fffffff000000000 (0xfffffff000000000)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000a03181e1: ffffffffc2675450 (_nv046641rm+0x90/0xffffffffffeddc40 [nvidia])
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000091e2f3b2: 0000000000000010 (0x10)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000086108207: 000000000000002a (0x2a)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000d33e3c5d: ffff99da0162b160 (0xffff99da0162b160)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000757b8425: 0000000000000000 ...
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000044bd8de: 0000000000000020 (0x20)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000018adc8e5: ffff99dbae3d6000 (0xffff99dbae3d6000)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000de77458e: ffffffffc187d9fa (nvidia_unlocked_ioctl+0x69a/0x910 [nvidia])
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000008ee3b466: ffffffffc263cce0 (nv_kthread_q+0x40/0xfffffffffff16360 [nvidia])
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000006cdc1fd9: ffff99daae453000 (0xffff99daae453000)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000ecbd3630: 0000792410ffe300 (0x792410ffe300)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000ac04496c: 0000000000000000 ...
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000076080858: ab98f14b84a55600 (0xab98f14b84a55600)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000008244d196: 0000000000000026 (0x26)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000003db5b3b4: ffff99daaa74ca00 (0xffff99daaa74ca00)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000ea7177d1: 00000000c020462a (0xc020462a)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000b2424361: 0000792410ffe300 (0x792410ffe300)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000dd975dc8: ffff99daaa74ca01 (0xffff99daaa74ca01)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000aad4106f: ffffc031c2d7bca8 (0xffffc031c2d7bca8)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000009b7cbbea: ffffffffb1d13310 (__x64_sys_ioctl+0xa0/0xf0)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000798e7d3f: ffffc031c2d7bf58 (0xffffc031c2d7bf58)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000a9906699: 0000000000000010 (0x10)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000b9849fe7: 0000000000000010 (0x10)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000740c902f: 0000000000000000 ...
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000bb112d11: ffffc031c2d7bcb8 (0xffffc031c2d7bcb8)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000a9734045: ffffffffb1805c18 (x64_sys_call+0xa68/0x24b0)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000002787874c: ffffc031c2d7bf48 (0xffffc031c2d7bf48)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000bc34974c: ffffffffb2a1d5c0 (do_syscall_64+0x80/0x170)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000600ac25f: 0000000000000001 (0x1)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000003c56ea4f: ffffc031c2d7bcc8 (0xffffc031c2d7bcc8)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000009310dbf2: ffff99da22611080 (0xffff99da22611080)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000031249dfb: 000079253d748000 (0x79253d748000)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000f1ae0625: 0000000000000920 (0x920)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000009dd4f913: ab98f14b84a55600 (0xab98f14b84a55600)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000d5e88418: 000000000000006e (0x6e)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000078b0fe1b: 0000000000000081 (0x81)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000002d90301b: 0000000000000001 (0x1)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000f418b291: 0000000000000081 (0x81)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000767487bd: 0000000000000001 (0x1)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000c6b9f643: ffffc031c2d7bd38 (0xffffc031c2d7bd38)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000002444120d: ffffffffb1a28578 (do_futex+0x128/0x230)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000071a36648: 0000000000000000 ...
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000035d957f9: ffffc031c2d7bdb8 (0xffffc031c2d7bdb8)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000deb971a2: ffffffffb1a28d55 (__x64_sys_futex+0x95/0x200)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000cb1de821: 00000000ffffffff (0xffffffff)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000079ca0e3a: 0000000000000001 (0x1)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000012db1bb2: 0000000000000081 (0x81)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000fdbfcf2b: 0000000000000001 (0x1)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000e34d0400: ffffc031c2d7bd80 (0xffffc031c2d7bd80)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000bd6676ee: 0000000000000000 ...
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000d653a54b: ab98f14b84a55600 (0xab98f14b84a55600)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000593d794e: ffffc031c2d7bf58 (0xffffc031c2d7bf58)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000044fefcdc: 00000000000000ca (0xca)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000bfc6e53b: ffffc031c2d7bdc8 (0xffffc031c2d7bdc8)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000db1128bd: ffffffffb2a24bc1 (syscall_exit_to_user_mode+0x81/0x270)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000677691e7: ffffc031c2d7bf58 (0xffffc031c2d7bf58)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000fc75c292: 00000000000000ca (0xca)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000c630d4a7: 00000000000000ca (0xca)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000df86cfe8: ffffc031c2d7bf48 (0xffffc031c2d7bf48)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000080154d86: ffffffffb2a1d5cc (do_syscall_64+0x8c/0x170)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000023d21365: ffffc031c2d7bf58 (0xffffc031c2d7bf58)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000003433188c: 00000000000000ca (0xca)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000008273ce25: ffffc031c2d7be10 (0xffffc031c2d7be10)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000602c4416: ffffffffb2a24bc1 (syscall_exit_to_user_mode+0x81/0x270)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000004f8b60e2: ffffc031c2d7bf58 (0xffffc031c2d7bf58)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000012f7f0a9: 00000000000000ca (0xca)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000004f4a0aa2: 00000000000000ca (0xca)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000090bfbba6: ffffc031c2d7bf48 (0xffffc031c2d7bf48)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000bcb47010: ffffffffb2a1d5cc (do_syscall_64+0x8c/0x170)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000004f82a0b4: ffff99dafb49d200 (0xffff99dafb49d200)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000058cabcc1: ffffc031c2d7bf58 (0xffffc031c2d7bf58)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000001fd487b8: ffffc031c2d7be58 (0xffffc031c2d7be58)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000ee0f5926: ffffffffb2a24bc1 (syscall_exit_to_user_mode+0x81/0x270)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000002a455724: ffffc031c2d7bf58 (0xffffc031c2d7bf58)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000a32d9c25: 00000000000000ca (0xca)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000003ef6a1fd: 00000000000000ca (0xca)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000001fa0bdf8: ffffc031c2d7bf48 (0xffffc031c2d7bf48)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000cfac601e: ffffffffb2a1d5cc (do_syscall_64+0x8c/0x170)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000c3f41c62: ffffc031c2d7bf58 (0xffffc031c2d7bf58)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000002acdbebd: 00000000000000ca (0xca)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000fb24ecdf: ffffc031c2d7bea0 (0xffffc031c2d7bea0)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000073b613f0: ffffffffb2a24bc1 (syscall_exit_to_user_mode+0x81/0x270)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000079cad60e: ffffc031c2d7bf58 (0xffffc031c2d7bf58)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000db2e7f95: 00000000000000ca (0xca)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000003a6d55fc: 00000000000000ca (0xca)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000038d0128e: ffffc031c2d7bf48 (0xffffc031c2d7bf48)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000004451441a: ffffffffb2a1d5cc (do_syscall_64+0x8c/0x170)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000028b14ef3: ffffc031c2d7bed8 (0xffffc031c2d7bed8)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000002ba1ad5b: ffffffffb186a9a0 (switch_fpu_return+0x50/0xe0)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000ff44cc7a: 0000000000004000 (0x4000)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000ab69a865: ffff99dafb49d200 (0xffff99dafb49d200)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000007cc0d133: ffffc031c2d7bf58 (0xffffc031c2d7bf58)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000e0d4fd90: ffffc031c2d7bf00 (0xffffc031c2d7bf00)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000f9cfdc3c: ffffffffb2a24bc1 (syscall_exit_to_user_mode+0x81/0x270)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000028fda66a: ffffc031c2d7bf58 (0xffffc031c2d7bf58)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000a34b5501: 00000000000000ca (0xca)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000e1cc5531: 00000000000000ca (0xca)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000d1157c5a: ffffc031c2d7bf48 (0xffffc031c2d7bf48)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000000e6b33d1: ffffffffb2a1d5cc (do_syscall_64+0x8c/0x170)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000087d0cf7a: ffffffffb2a1d5cc (do_syscall_64+0x8c/0x170)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000bd367644: ffffffffb2a1d5cc (do_syscall_64+0x8c/0x170)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000006accd62e: ffffc031c2d7bf48 (0xffffc031c2d7bf48)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000096f0cef2: ffffffffb2a238a7 (sysvec_apic_timer_interrupt+0x57/0xc0)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000009426a63f: 0000000000000000 ...
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000006a44d8a8: ffffffffb2c0012b (entry_SYSCALL_64_after_hwframe+0x76/0x7e)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000087d5e195: 0000792410ffe150 (0x792410ffe150)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000002c8deccb: 0000000066f8bcb2 (0x66f8bcb2)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000f85321ea: 0000792410ffe31c (0x792410ffe31c)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000b5eaface: 0000000000000026 (0x26)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000e27a1b8e: 00000000c020462a (0xc020462a)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000000a4a1b1c: 0000792410ffe300 (0x792410ffe300)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000020dde692: 0000000000000246 (0x246)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000038d9d468: 0000792410fff350 (0x792410fff350)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000b0279a19: 0000792410ffe31c (0x792410ffe31c)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000005da67bed: 0000792410ffe300 (0x792410ffe300)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000aa1c4f32: ffffffffffffffda (0xffffffffffffffda)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000288e8b33: 000079255f51a94f (0x79255f51a94f)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000eda0c7ba: 0000792410ffe300 (0x792410ffe300)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000c8f63e2d: 00000000c020462a (0xc020462a)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000040900093: 0000000000000026 (0x26)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000aa294582: 0000000000000010 (0x10)
Sep 28 23:34:26 oryx.wonder.boy kernel: 0000000069f7d4ca: 000079255f51a94f (0x79255f51a94f)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000340dac93: 0000000000000033 (0x33)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000005b4d02d6: 0000000000000246 (0x246)
Sep 28 23:34:26 oryx.wonder.boy kernel: 00000000afa954c3: 0000792410ffe0f0 (0x792410ffe0f0)
Sep 28 23:34:26 oryx.wonder.boy kernel: 000000002c994446: 000000000000002b (0x2b)
Sep 28 23:34:26 oryx.wonder.boy kernel:  ? _nv012865rm+0x77/0x330 [nvidia]
Sep 28 23:34:26 oryx.wonder.boy kernel:  ? _nv048628rm+0x49f/0x7f0 [nvidia]
Sep 28 23:34:26 oryx.wonder.boy kernel:  ? _nv000720rm+0x173/0x320 [nvidia]
Sep 28 23:34:26 oryx.wonder.boy kernel:  ? _nv000691rm+0x1a0/0x1a0 [nvidia]
Sep 28 23:34:26 oryx.wonder.boy kernel:  ? _nv013137rm+0x3d/0xa0 [nvidia]
Sep 28 23:34:26 oryx.wonder.boy kernel:  ? _nv000745rm+0x8d2/0xe00 [nvidia]
Sep 28 23:34:26 oryx.wonder.boy kernel:  ? rm_ioctl+0x7f/0x400 [nvidia]
Sep 28 23:34:26 oryx.wonder.boy kernel:  ? nvidia_unlocked_ioctl+0x69a/0x910 [nvidia]
Sep 28 23:34:26 oryx.wonder.boy kernel:  ? __x64_sys_ioctl+0xa0/0xf0
Sep 28 23:34:26 oryx.wonder.boy kernel:  ? x64_sys_call+0xa68/0x24b0
Sep 28 23:34:26 oryx.wonder.boy kernel:  ? do_syscall_64+0x80/0x170
Sep 28 23:34:26 oryx.wonder.boy kernel:  ? do_futex+0x128/0x230
Sep 28 23:34:26 oryx.wonder.boy kernel:  ? __x64_sys_futex+0x95/0x200
Sep 28 23:34:26 oryx.wonder.boy kernel:  ? syscall_exit_to_user_mode+0x81/0x270
Sep 28 23:34:26 oryx.wonder.boy kernel:  ? do_syscall_64+0x8c/0x170
Sep 28 23:34:26 oryx.wonder.boy kernel:  ? syscall_exit_to_user_mode+0x81/0x270
Sep 28 23:34:26 oryx.wonder.boy kernel:  ? do_syscall_64+0x8c/0x170
Sep 28 23:34:26 oryx.wonder.boy kernel:  ? syscall_exit_to_user_mode+0x81/0x270
Sep 28 23:34:26 oryx.wonder.boy kernel:  ? do_syscall_64+0x8c/0x170
Sep 28 23:34:26 oryx.wonder.boy kernel:  ? syscall_exit_to_user_mode+0x81/0x270
Sep 28 23:34:26 oryx.wonder.boy kernel:  ? do_syscall_64+0x8c/0x170
Sep 28 23:34:26 oryx.wonder.boy kernel:  ? switch_fpu_return+0x50/0xe0
Sep 28 23:34:26 oryx.wonder.boy kernel:  ? syscall_exit_to_user_mode+0x81/0x270
Sep 28 23:34:26 oryx.wonder.boy kernel:  ? do_syscall_64+0x8c/0x170
Sep 28 23:34:26 oryx.wonder.boy kernel:  ? do_syscall_64+0x8c/0x170
Sep 28 23:34:26 oryx.wonder.boy kernel:  ? do_syscall_64+0x8c/0x170
Sep 28 23:34:26 oryx.wonder.boy kernel:  ? sysvec_apic_timer_interrupt+0x57/0xc0
Sep 28 23:34:26 oryx.wonder.boy kernel:  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
Sep 28 23:34:26 oryx.wonder.boy kernel:  </TASK>
Sep 28 23:34:27 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:27 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:27 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:27 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:27 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:27 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:27 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:27 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:27 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Failed to get topology status f
Sep 28 23:34:27 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:27 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:27 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:27 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:27 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:27 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:27 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:27 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:29 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Failed to get topology status f
Sep 28 23:34:29 oryx.wonder.boy kernel: NVRM: Error in service of callback 
Sep 28 23:34:29 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:29 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:29 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:29 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:30 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:30 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:30 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:30 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:30 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:30 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:30 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:30 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:30 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:30 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:30 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:30 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:30 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:30 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:30 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:30 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:30 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:30 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:30 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:30 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:31 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:31 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:31 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:31 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:31 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:31 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:31 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Failed to get topology status f
Sep 28 23:34:31 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:31 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:31 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:31 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:31 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:31 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:31 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:31 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:31 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:31 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:31 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:31 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:31 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:31 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:32 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:32 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:32 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:32 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:32 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:32 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:32 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:32 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:32 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:32 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:32 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:32 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:32 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:32 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:32 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:32 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:32 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:32 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:32 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:32 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:33 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:33 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:33 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:33 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:33 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:33 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:33 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Failed to get topology status f
Sep 28 23:34:33 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:33 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:33 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:33 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:33 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:33 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:33 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:33 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:33 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:33 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:33 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:33 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:33 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:33 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:34 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:34 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:34 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:34 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:34 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:34 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:34 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:34 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:34 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:34 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:34 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:34 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:34 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:34 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:34 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:34 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:34 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:34 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Sep 28 23:34:34 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: error setting power limit
Sep 28 23:34:34 oryx.wonder.boy /usr/bin/nvidia-powerd[813]: Error setting GPU limit: 140000.
Vetpetmon commented 1 month ago

So, reinstalling the OS did stabilize the issue. To a degree. Crashed when I was watching a Youtube video through Steam this time.

Kernlog: crash.log

Vetpetmon commented 1 month ago

@ziprasidone146939277 My modeset is set to 1, and I am on my own hardware. Run nvidia-smi and then uname -a in the terminal

I am running with NVIDIA-SMI 550.67 Driver Version: 550.67 CUDA Version: 12.4 on Linux bubz 6.9.3-76060903-generic #202405300957~1718348209~22.04~7817b67 SMP PREEMPT_DYNAMIC Mon J x86_64 x86_64 x86_64 GNU/Linux

These errors before my latest crash may be related:

Oct  1 08:49:42 bubz kernel: [116386.643392] workqueue: nv_drm_handle_hotplug_event [nvidia_drm] hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
Oct  1 08:49:42 bubz kernel: [116386.782393] workqueue: nv_drm_handle_hotplug_event [nvidia_drm] hogged CPU for >10000us 5 times, consider switching to WQ_UNBOUND
Oct  1 14:35:39 bubz kernel: [137143.453584] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000700] Failed to grab modeset ownership
Oct  1 14:35:46 bubz kernel: [137150.369191] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000700] Failed to grab modeset ownership
Oct  1 15:00:06 bubz kernel: [138610.475590] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000700] Failed to grab modeset ownership
Oct  1 15:00:11 bubz kernel: [138614.514390] vlc[329586]: segfault at 20 ip 000077d50aaa9717 sp 00007ffdefe5a540 error 4 in libnvidia-glcore.so.550.67[77d50a400000+c00000] likely on CPU 10 (core 5, socket 0)
Oct  1 15:00:11 bubz kernel: [138614.514422] Code: 83 c5 10 45 31 e4 31 ed eb 19 66 0f 1f 84 00 00 00 00 00 83 c5 01 49 83 c4 70 41 39 ed 0f 84 4c ff ff ff 48 8b 83 e0 00 00 00 <4a> 83 7c 20 20 00 74 e1 31 d2 89 ee 48 89 df e8 15 5d ff ff eb d3
Oct  1 15:05:15 bubz kernel: [138918.663506] AppRun.wrapped[325914]: segfault at 0 ip 000000000057bf14 sp 00007ffcc1284100 error 4 in kdenlive[400000+a45000] likely on CPU 3 (core 4, socket 0)
Oct  1 15:05:15 bubz kernel: [138918.663525] Code: c9 31 d2 31 f6 e8 1c 76 00 00 48 8b 84 24 98 00 00 00 48 8d 54 24 7f 4c 89 e7 48 8d b4 24 88 00 00 00 48 8d ac 24 80 00 00 00 <f2> 0f 10 00 f2 0f 11 84 24 88 00 00 00 e8 6a 24 ff ff 80 7c 24 7f
Oct  1 15:09:30 bubz kernel: [139173.578932] AppRun.wrapped[424088]: segfault at 0 ip 000000000057bf14 sp 00007ffd1d4edb90 error 4 in kdenlive[400000+a45000] likely on CPU 3 (core 4, socket 0)
Oct  1 15:09:30 bubz kernel: [139173.578952] Code: c9 31 d2 31 f6 e8 1c 76 00 00 48 8b 84 24 98 00 00 00 48 8d 54 24 7f 4c 89 e7 48 8d b4 24 88 00 00 00 48 8d ac 24 80 00 00 00 <f2> 0f 10 00 f2 0f 11 84 24 88 00 00 00 e8 6a 24 ff ff 80 7c 24 7f
Oct  1 16:04:37 bubz kernel: [142481.081097] workqueue: nv_drm_handle_hotplug_event [nvidia_drm] hogged CPU for >10000us 7 times, consider switching to WQ_UNBOUND

Before it then devolved to hundreds of errors still starting at pcieport 0000:00:03.1 which eventually causes the graphics card on board to freak out. I don't know if this is a bad config, or bad drivers. Either way, even before I reinstalled Pop in early August, I didn't have issues with nvidia 550, which leads me to believe an iso file dating back to 2023 is the only way to fix it for sure.

There's possibly a bad config setting in the newer system images that isn't present on the older images. Configs do not typically change without a full change of installation. I'll see if I have a working Timeshift point, which is easier said than done, my HDD which had my oldest full system backup had decided it had enough.

mmstick commented 1 month ago

You can install nvidia-driver-550-server if necessary. It may be worth providing input on the NVIDIA forum's 560 driver bug thread. https://forums.developer.nvidia.com/t/560-release-feedback-discussion/300830/407

tedliosu commented 1 month ago

I just thought I'd add that I have a System76 Gazelle and have been dealing with this issue for about a year and a half now.

At some point not too long after I first got this laptop, I started to have the same issue you're describing. I opened a ticket and spent a while diagnosing it with System76 support, and eventually I sent my laptop in and they replaced the mainboard, and the whole process of sending it in and getting it back took a few weeks. After I got it back, I continued to have the same problem. I really couldn't afford to be without my work computer for a few more weeks, so I've just gotten used to having my laptop randomly lock up when I'm away from it.

I spent a while trying to debug it and found out that this issue happens specifically when the GPU wakes up from being in a low-power mode, and running a low-power process that's constantly touching the GPU (like glxgears) seems to alleviate the issue to a degree. Without it, my laptop often locks up at least once a day, sometimes more often, and on occasion even while I'm using it; if I just leave glxgears running in a corner, it will often be fine for several days at a time, sometimes over a week.

The interesting thing is that sometime recently, I realized my laptop had reached a point where it had been running over three weeks solid without freezing. I suspect that nvidia-driver-550 in specific may have done something to help, because I updated to nvidia-driver-555 a week ago and the problem suddenly resumed; now I'm getting freezes regularly again.

I just tried installing nvidia-driver-550-server, and I'm running on that now. No freezes yet, but it's only been 30 minutes, so it remains to be seen if that will work as well as nvidia-driver-550. I really wish System76 would preserve at least their last few releases on their apt server...

I'm running Ubuntu 22.04 here, and funny enough I got a GPU fallen off the bus error on driver version 550 previously prompting me to upgrade to 560, and I got the same error again today, but both drivers are from the Nvidia CUDA Compute Repos since I have to use CUDA for both personal and school projects; @pjreed are you running intel or amd as your CPU btw since I'm running Intel 11400H with my 3050 laptop and this unix stackexchange post also details someone with a similar issue previously albeit it's an Intel server in their case not an intel laptop. I'm simply trying to determine if this issue is platform specific or not.

Vetpetmon commented 1 month ago

@tedliosu Can confirm it's not platform specific for hardware. Basic Ubuntu 22.04 also having issues is very concerning, though, as it confirms this is a widespread issue for Ubuntu 22.04 and any of its flavors or forks. Ubuntu's Linux kernels are limited further back than Pop!_OS's, so now we can eliminate the kernel from the list of possible points of failure. That just leaves us with GNOME and NVIDIA drivers now.

Later this week, I will try out Pop!_OS COSMIC (Based on Ubuntu 24.04) after work. I will collect as much information as I can in the meantime.

CPU: Ryzen 5 2600 GPU: GTX 1660 (TU116 chipset) Motherboard: ASUS PRIME A320M-K (Rev X.0x) OS: Pop!_OS 22.04 LTS DE: GNOME 42.9, X11 display server

tedliosu commented 1 month ago

@tedliosu Can confirm it's not platform specific for hardware. Basic Ubuntu 22.04 also having issues is very concerning, though, as it confirms this is a widespread issue for Ubuntu 22.04 and any of its flavors or forks. Ubuntu's Linux kernels are limited further back than Pop!_OS's, so now we can eliminate the kernel from the list of possible points of failure. That just leaves us with GNOME and NVIDIA drivers now.

Later this week, I will try out Pop!_OS COSMIC (Based on Ubuntu 24.04) after work. I will collect as much information as I can in the meantime.

CPU: Ryzen 5 2600 GPU: GTX 1660 (TU116 chipset) Motherboard: ASUS PRIME A320M-K (Rev X.0x) OS: Pop!_OS 22.04 LTS DE: GNOME 42.9, X11 display server

@Vetpetmon Thank you for the information, but I am specifically curious as to whether the fallen off bus while system is idle is platform specific or not, as you have run into the fallen off bus error while your system is under load, whereas my GPU has never ever fallen off the bus when it's under load thus far in the several years I've had the laptop; here's the output of fastfetch --logo none on my end for reference:

tedliosu@victus-ted
-------------------
OS: Ubuntu jammy 22.04 x86_64
Host: Victus by HP Laptop 16-d0xxx
Kernel: Linux 6.8.0-40-generic
Uptime: 2 hours, 6 mins
Packages: 4092 (dpkg), 5 (flatpak)
Shell: bash 5.1.16
Display (CMN1606): 1920x1080 @ 60 Hz in 16″ [Built-in] *
Display (ASUS VP247): 1920x1080 @ 60 Hz in 24″ [External]
DE: Xfce4 4.16
WM: Xfwm4 (X11)
WM Theme: Greybird-dark-accessibility
Theme: Greybird-dark [GTK2/3/4]
Icons: elementary-xfce-darker [GTK2/3/4]
Font: Sans (12pt) [GTK2/3/4]
Cursor: DMZ-White
Terminal: xfce4-terminal 0.8.10
Terminal Font: Noto Mono (12pt)
CPU: 11th Gen Intel(R) Core(TM) i5-11400H (12) @ 4.50 GHz
GPU 1: NVIDIA GeForce RTX 3050 Mobile [Discrete]
GPU 2: Intel UHD Graphics @ 1.45 GHz [Integrated]
Memory: 4.30 GiB / 30.99 GiB (14%)
Swap: 0 B / 32.00 GiB (0%)
Disk (/): 400.64 GiB / 883.33 GiB (45%) - ext4
Local IP (wlo1): 192.168.50.155/24
Battery (Primary): 100% [AC Connected]
Locale: en_US.UTF-8

And output of sudo -H dmidecode | grep --after-context=4 "Base Board" with the serial number redacted:

Base Board Information
    Manufacturer: HP
    Product Name: 88F9
    Version: 88.58
    Serial Number: [redacted]
    Asset Tag: Base Board Asset Tag
    Features:
        Board is a hosting board
        Board is replaceable
    Location In Chassis: Base Board Chassis Location
    Chassis Handle: 0x0003
    Type: Motherboard
    Contained Object Handles: 0
Vetpetmon commented 1 month ago

@tedliosu I've had it happen under heavy loads, once shortly after boot, many after having a heavy load, and then a number which have happened under minimal load (idle).

I've also noted down in nvidia-smi -q that sometimes, the pcie bus link width, while it goes up to 16x, starts at 8x, then goes up to 16x after some condition is met. While I am unable to check due to the whole system locking up, just from doing even more research into PCIe links and hotplugging, I think it might be falling off when decreasing it's link width from 16x back to 8x. I don't know though, as I am unable to collect the error log after it falls off, and it's too short of a window to collect the data on a PC with no external access set up.

tedliosu commented 1 month ago

@tedliosu I've had it happen under heavy loads, once shortly after boot, many after having a heavy load, and then a number which have happened under minimal load (idle).

I've also noted down in nvidia-smi -q that sometimes, the pcie bus link width, while it goes up to 16x, starts at 8x, then goes up to 16x after some condition is met. While I am unable to check due to the whole system locking up, just from doing even more research into PCIe links and hotplugging, I think it might be falling off when decreasing it's link width from 16x back to 8x. I don't know though, as I am unable to collect the error log after it falls off, and it's too short of a window to collect the data on a PC with no external access set up.

@Vetpetmon Sorry if this sounds dumb, and I understand that you've already tried reseating the GPU in the PCIe slot, but have you checked to make sure the PCIe power cables are properly seated in your GPU? Some Gentoo user a while back discovered that the culprit for them was a loose power cable and I just want to make sure that it isn't the case for you.

Vetpetmon commented 1 month ago

@tedliosu Here's a list of everything I've done:

ziprasidone146939277 commented 1 month ago
  • Modified kernel boot parameters to disable ASPM, AER, and MMCONF to confirm it's not a BIOS issue

In this sense, what I've tried, and based on Chapter 9. Known Issues is the ibt="off" kernel parameter with no success after few days. Furthermore, this should, in theory, already be covered by the kernel version.

My suspicion is that (in my case) this is a nvidia-powerd related issue. What I can't evidence it for now.

Vetpetmon commented 1 month ago

My PCIe Link Width switched from 8x to 16x after suspend and wake. There's a good chance that if it drops power levels, or (theoretically) drops to the normal 8x link width seen after boot, it could "fall off" image

I do think this is an issue with the DRM in the NVIDIA kernel drivers, considering I don't lose image, it means the GPU is still actually powered, since both video displays are connected to the card. What's weird though, is that no USB devices are connected to the card directly. I'd probably just say it's X11 and its methods of input also crashing when NVIDIA's kernel modules do.

The DRM is the most direct connection to the GPU and VRAM, and kernel logs show that it has hanged the CPU multiple times before ultimately crashing. These hangs present themselves like the crash (Including the user input failing) when the GPU falls off of the bus, but are temporary (and threatening!)

I have reason to suspect a software bug while the GPU is dropping to lower power states may be responsible.

tedliosu commented 1 month ago

UPDATE - GPU just fell off the bus again but this time it wasn't when it was fully idle and it was while I was using the laptop to do some jupyter notebook work and web browsing, no external monitors plugged in as I am at school right now; I have attached the output of journalctl -x -b -1 | grep --before-context=40 --after-context=40 NVRM | tail -n+248 for the journalctl log of moments right before the error occurred and what happened right after that. journalctl_crash_output_fallen_off_bus_10-3-24.txt

ziprasidone146939277 commented 1 month ago

Not 100% related, but because of this:

NVRM: Xid (PCI:0000:01:00): 119, pid=2855, name=Xorg, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL

What I am trying is this kernel parameter:

nvidia.NVreg_EnableGpuFirmware=0

And no hangs/freezes for now.

MiyeonLin commented 1 month ago

Easy fix, based on arch wiki

sudo nano /etc/modprobe.d/nvidia.conf

add options nvidia NVreg_PreserveVideoMemoryAllocations=1 to the file.

Reboot and now suspend.

pjreed commented 1 month ago

Pop!_OS already sets that option in /etc/modprobe.d/system76-power.conf, and that has not fixed this issue for me (nor, I presume, anybody else here who is also using Pop)

Vetpetmon commented 3 weeks ago

NVIDIA themselves, in the 565 release:

Fixed a bug that could cause kernel crashes upon attempting KMS operations through DRM when nvidia_drm was loaded with modeset=0.

nvidia-graphics-drivers-kms.conf

# This file was generated by nvidia-driver-550
# Set value to 0 to disable modesetting
options nvidia-drm modeset=1

system76-power.conf

# Automatically generated by iso
blacklist i2c_nvidia_gpu
alias i2c_nvidia_gpu off

Do I need to set modeset from 1 to 0 to fix this, whenever System76 updates the OS to use 565?

Some things to note is that 565 has an issue with vertex explosions with DX12, and possible GSP Firmware issues. My GTX 1660, despite being a Turing chip, is reading GSP Firmware version, giving me N/A as a result. Still currently on the proprietary 550 drivers.

EDIT 2: I noticed that pop os doesn't specify fbdev=1 for versions above 545, as seen in various posts around the Arch Linux communities. They state it should be paired with modeset=1

If this happens to be the issue the whole time...

EDIT 3: So, modeset=1 is not even necessary at all for Pop!_OS 22.04, as modeset=1 is only necessary if you are running Wayland as your Display Server, not X11. Pop!_OS 22.04, by default, is running X11.

May I also request for System76 to provide option to use nvidia-open if we have Turing (GTX1650) or newer GPUs, instead of only forcing the proprietary drivers that not even NVIDIA is recommending that newer GPUs use?

Vetpetmon commented 3 weeks ago

Update: I am beginning to undertake a project to fully switch to Windows 11 on this machine. If no GPU crashes appear on that OS for over 2 weeks of usage, I'll fully confirm it's a driver issue with the Linux OS.

ziprasidone146939277 commented 3 weeks ago

@Vetpetmon have You tried this?

nvidia.NVreg_EnableGpuFirmware=0 It's working with no issues for me for last weeks. I've set that via kernelstub.

Vetpetmon commented 3 weeks ago

@ziprasidone146939277 here's what my system says about GSP: GSP Firmware Version : N/A from nvidia-smi -q | grep GSP, this is suggesting the GSP/firmware isn't enabled.

I am on the TU116, aka Turing chip. It should have GSP, so I will try enabling it with sudo kernelstub -a "nvidia.NVreg_EnableGpuFirmware=1"

I will reboot and give results after attempting 4 times to induce the kernel crash. Still on 550.

Quick update, system booted up perfectly fine after enabling GSP. Results from grep: GSP Firmware Version : 550.67 GSP is now enabled, I will see if having none of the firmware accessible was the cause all along

Update: As of 4:20 PM, there has been no such instance of the GPU falling off of the bus. Maybe not enabling the firmware on a GPU that is very much capable of GSP may have been the issue all along! Jinxed.

jacobcalvert commented 3 weeks ago

Other commenters have noted the same but I wanted to throw my experience in here in case it helps narrow down things. I'm running Linux Mint 21, Kernel 6.9.3, nvida driver v555, on a System76 Gazelle 17. I am using the laptop display and an external monitor via a thunderbolt dock.

I was frequently and regularly (daily at least) having the GPU fall off the bus causing me to have to hard reboot the box. I noticed it was always when the external monitor went to sleep that this was happening. I saw another comment suggest it might be due to the power state, so I changed the settings so that the external monitor never sleeps and only locks the screen. It's been 6 weeks and the machine has not been rebooted nor has the GPU fallen off the bus, so I feel it is related to the power states.

Vetpetmon commented 1 week ago

I have switched to Windows 11. There have been no instances of the NVIDIA GPU failing off of the bus or crashing, even after performing actions such as updating the drivers, unplugging and re-plugging in displays, stopping a screenshare and waiting 10 minutes to test dropping the power states. Over a week later of performing my usual activities (constantly opening/closing GPU-using programs) there's been no issues involving the GPU at all. Going back to Linux, however, the GPU starts falling off of the bus again almost immediately, even with the monitor never set to turn off after a set period of time.

This is confirmed to be a driver issue for the Linux NVIDIA drivers. This is in no way a hardware issue.

cstrahan commented 2 days ago

After I got it back, I continued to have the same problem. I really couldn't afford to be without my work computer for a few more weeks, so I've just gotten used to having my laptop randomly lock up when I'm away from it.

This resonates too well.

Bought a Gazelle ([gaze16]) about 3 years ago. Got used to restarting the laptop each morning.

Now it's usually freezing within 5-30 minutes of starting up. Previous boot log shows the "GPU has fallen off the bus" message at the end.

NVIDIA Driver Version: 560.35.03