tomsom / yoga-linux

Run Linux on the Lenovo Yoga 7 14 (14ARB7) with AMD Ryzen 6800U (Rembrand).
https://github.com/tomsom/yoga-linux/wiki
56 stars 2 forks source link

Random GPU timeouts #38

Closed 0x9fff00 closed 1 year ago

0x9fff00 commented 1 year ago

CPU: AMD Ryzen 5 6600U
RAM: 16 GB
Display: 2.2k IPS
BIOS: K5CN40WWT66
Kernel: 6.3.4-arch1 with workaround patch for #9
Distribution: Arch Linux
Desktop environment: KDE Plasma 5.27.5

I have a problem where all open GUI programs randomly freeze, but I can still move the mouse cursor. When this happens, I get errors like this in dmesg:

amdgpu 0000:33:00.0: [drm] *ERROR* [CRTC:72:crtc-0] flip_done timed out

or

[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=121570, emitted seq=121572
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
amdgpu 0000:33:00.0: amdgpu: GPU reset begin!
amdgpu 0000:33:00.0: amdgpu: MODE2 reset
amdgpu 0000:33:00.0: amdgpu: GPU reset succeeded, trying to resume
[drm] PCIE GART of 1024M enabled (table at 0x000000F47FC00000).
[drm] PSP is resuming...
[drm] reserve 0xa00000 from 0xf47e000000 for PSP TMR
amdgpu 0000:33:00.0: amdgpu: RAS: optional ras ta ucode is not available
amdgpu 0000:33:00.0: amdgpu: RAP: optional rap ta ucode is not available
amdgpu 0000:33:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
amdgpu 0000:33:00.0: amdgpu: SMU is resuming...
amdgpu 0000:33:00.0: amdgpu: SMU is resumed successfully!
[drm] DMUB hardware initialized: version=0x0400002E
[drm] Watermarks table not configured properly by SMU
[drm] kiq ring mec 2 pipe 1 q 0
[drm] VCN decode and encode initialized successfully(under DPG Mode).
[drm] JPEG decode initialized successfully.
amdgpu 0000:33:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
amdgpu 0000:33:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
amdgpu 0000:33:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
amdgpu 0000:33:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
amdgpu 0000:33:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
amdgpu 0000:33:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
amdgpu 0000:33:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
amdgpu 0000:33:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
amdgpu 0000:33:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
amdgpu 0000:33:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
amdgpu 0000:33:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
amdgpu 0000:33:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 1
amdgpu 0000:33:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 1
amdgpu 0000:33:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 1
amdgpu 0000:33:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1
amdgpu 0000:33:00.0: amdgpu: recover vram bo from shadow start
amdgpu 0000:33:00.0: amdgpu: recover vram bo from shadow done
amdgpu 0000:33:00.0: amdgpu: GPU reset(1) succeeded!

I suspect this may be related to the DPC watchdog violation BSODs on Windows as both problems seem to come in similar waves for me. Sometimes if I switch to another TTY and then back, everything works normally again, and sometimes I need to reboot. I also get these warnings in dmesg, but the crash doesn't happen until later, sometimes not for several hours:

------------[ cut here ]------------
WARNING: CPU: 6 PID: 173 at drivers/gpu/drm/amd/amdgpu/../display/dc/dce/dmub_psr.c:123 dmub_psr_get_state+0xc6/0xd0 [amdgpu]
Modules linked in: ccm snd_seq_dummy snd_hrtimer snd_seq snd_seq_device rfcomm xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat br_netfilter bridge stp llc wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel overlay cmac algif_hash algif_skcipher af_alg bnep snd_acp6x_pdm_dma snd_soc_acp6x_mach snd_soc_dmic snd_sof_amd_rembrandt snd_sof_amd_renoir snd_sof_amd_acp snd_sof_pci snd_sof_xtensa_dsp snd_hda_codec_realtek snd_sof snd_hda_codec_generic snd_sof_utils ledtrig_audio snd_soc_core snd_hda_codec_hdmi snd_compress amdgpu hid_sensor_accel_3d ac97_bus snd_hda_intel mt7921e snd_pcm_dmaengine hid_sensor_trigger snd_intel_dspcfg mt7921_common snd_pci_ps industrialio_triggered_buffer snd_intel_sdw_acpi mt76_connac_lib uvcvideo kfifo_buf snd_rpl_pci_acp6x intel_rapl_msr drm_buddy snd_hda_codec mt76 intel_rapl_common gpu_sched videobuf2_vmalloc hid_sensor_iio_common snd_acp_pci uvc snd_pci_acp6x i2c_algo_bit
 snd_hda_core industrialio joydev videobuf2_memops hid_sensor_custom drm_ttm_helper btusb snd_pci_acp5x snd_hwdep videobuf2_v4l2 edac_mce_amd btrtl ttm snd_pcm snd_rn_pci_acp3x mac80211 btbcm ucsi_acpi snd_acp_config mousedev btintel wacom snd_timer videodev typec_ucsi hid_sensor_hub kvm_amd drm_display_helper snd_soc_acpi ideapad_laptop sp5100_tco vfat btmtk libarc4 usbhid hid_multitouch bluetooth kvm videobuf2_common sparse_keymap ecdh_generic irqbypass mc fat crc16 rapl typec thunderbolt wmi_bmof platform_profile k10temp snd cfg80211 cec snd_pci_acp3x i2c_piix4 soundcore roles rfkill i2c_hid_acpi i2c_hid amd_pmc acpi_cpufreq acpi_tad mac_hid ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog xt_recent xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables iptable_filter vboxnetflt(OE) vboxnetadp(OE) vboxdrv(OE) pkcs8_key_parser dm_multipath crypto_user fuse loop ip_tables x_tables btrfs blake2b_generic xor
 raid6_pq libcrc32c crc32c_generic dm_crypt cbc encrypted_keys trusted asn1_encoder tee dm_mod serio_raw atkbd crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni libps2 polyval_generic nvme vivaldi_fmap gf128mul ghash_clmulni_intel sha512_ssse3 sdhci_pci aesni_intel cqhci crypto_simd sdhci cryptd nvme_core xhci_pci mmc_core i8042 ccp video xhci_pci_renesas nvme_common serio wmi
CPU: 6 PID: 173 Comm: kworker/6:1H Tainted: G           OE      6.3.4-arch1-1-14arb7 #1 8d2cc948bb10b89d339051b4201f5d88832cfc8c
Hardware name: LENOVO 82QF/LNVNB161216, BIOS K5CN40WWT66 05/04/2023
Workqueue: events_highpri dm_irq_work_func [amdgpu]
RIP: 0010:dmub_psr_get_state+0xc6/0xd0 [amdgpu]
Code: 00 00 74 b4 48 8b 44 24 08 65 48 2b 04 25 28 00 00 00 75 1a 48 83 c4 10 5b 5d 41 5c 41 5d c3 cc cc cc cc 3d ff 00 00 00 75 da <0f> 0b eb d6 e8 81 0c 99 eb 90 90 90 90 90 90 90 90 90 90 90 90 90
RSP: 0018:ffff9e3081dc7ca0 EFLAGS: 00010246
RAX: 00000000000000ff RBX: 00000000000003e9 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000001681 RDI: ffff8acb45280000
RBP: ffff8acb00a73800 R08: 0000000000000000 R09: ffff9e313fbe3900
R10: 0000000000000000 R11: fefefefefefefeff R12: 0000000000000000
R13: ffff9e3081dc7cdc R14: ffff8acb115f6b40 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff8acdeff80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f4de4590000 CR3: 00000001fb4c8000 CR4: 0000000000750ee0
PKRU: 55555554
Call Trace:
 <TASK>
 dmub_psr_enable+0xd4/0x120 [amdgpu ee83c059d2ae5508d03c4d6f9b6ec7d440e8becc]
 dc_link_set_psr_allow_active+0x27e/0x3b0 [amdgpu ee83c059d2ae5508d03c4d6f9b6ec7d440e8becc]
 dc_link_handle_hpd_rx_irq+0x318/0x350 [amdgpu ee83c059d2ae5508d03c4d6f9b6ec7d440e8becc]
 handle_hpd_rx_irq+0xca/0x490 [amdgpu ee83c059d2ae5508d03c4d6f9b6ec7d440e8becc]
 ? __schedule+0x44b/0x1400
 ? blk_mq_run_hw_queues+0x8a/0x110
 process_one_work+0x1c7/0x3d0
 worker_thread+0x51/0x390
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xde/0x110
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x2c/0x50
 </TASK>
---[ end trace 0000000000000000 ]---

and about 30 minutes later

------------[ cut here ]------------
WARNING: CPU: 6 PID: 4984 at drivers/gpu/drm/amd/amdgpu/../display/dc/dce/dmub_psr.c:223 dmub_psr_enable+0x10a/0x120 [amdgpu]
Modules linked in: udp_diag tcp_diag inet_diag ccm snd_seq_dummy snd_hrtimer snd_seq snd_seq_device rfcomm xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat br_netfilter bridge stp llc wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel overlay cmac algif_hash algif_skcipher af_alg bnep snd_acp6x_pdm_dma snd_soc_acp6x_mach snd_soc_dmic snd_sof_amd_rembrandt snd_sof_amd_renoir snd_sof_amd_acp snd_sof_pci snd_sof_xtensa_dsp snd_hda_codec_realtek snd_sof snd_hda_codec_generic snd_sof_utils ledtrig_audio snd_soc_core snd_hda_codec_hdmi snd_compress amdgpu hid_sensor_accel_3d ac97_bus snd_hda_intel mt7921e snd_pcm_dmaengine hid_sensor_trigger snd_intel_dspcfg mt7921_common snd_pci_ps industrialio_triggered_buffer snd_intel_sdw_acpi mt76_connac_lib uvcvideo kfifo_buf snd_rpl_pci_acp6x intel_rapl_msr drm_buddy snd_hda_codec mt76 intel_rapl_common gpu_sched videobuf2_vmalloc hid_sensor_iio_common snd_acp_pci uvc
 snd_pci_acp6x i2c_algo_bit snd_hda_core industrialio joydev videobuf2_memops hid_sensor_custom drm_ttm_helper btusb snd_pci_acp5x snd_hwdep videobuf2_v4l2 edac_mce_amd btrtl ttm snd_pcm snd_rn_pci_acp3x mac80211 btbcm ucsi_acpi snd_acp_config mousedev btintel wacom snd_timer videodev typec_ucsi hid_sensor_hub kvm_amd drm_display_helper snd_soc_acpi ideapad_laptop sp5100_tco vfat btmtk libarc4 usbhid hid_multitouch bluetooth kvm videobuf2_common sparse_keymap ecdh_generic irqbypass mc fat crc16 rapl typec thunderbolt wmi_bmof platform_profile k10temp snd cfg80211 cec snd_pci_acp3x i2c_piix4 soundcore roles rfkill i2c_hid_acpi i2c_hid amd_pmc acpi_cpufreq acpi_tad mac_hid ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog xt_recent xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables iptable_filter vboxnetflt(OE) vboxnetadp(OE) vboxdrv(OE) pkcs8_key_parser dm_multipath crypto_user fuse loop ip_tables x_tables
 btrfs blake2b_generic xor raid6_pq libcrc32c crc32c_generic dm_crypt cbc encrypted_keys trusted asn1_encoder tee dm_mod serio_raw atkbd crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni libps2 polyval_generic nvme vivaldi_fmap gf128mul ghash_clmulni_intel sha512_ssse3 sdhci_pci aesni_intel cqhci crypto_simd sdhci cryptd nvme_core xhci_pci mmc_core i8042 ccp video xhci_pci_renesas nvme_common serio wmi
CPU: 6 PID: 4984 Comm: kworker/6:0H Tainted: G        W  OE      6.3.4-arch1-1-14arb7 #1 8d2cc948bb10b89d339051b4201f5d88832cfc8c
Hardware name: LENOVO 82QF/LNVNB161216, BIOS K5CN40WWT66 05/04/2023
Workqueue: events_highpri dm_irq_work_func [amdgpu]
RIP: 0010:dmub_psr_enable+0x10a/0x120 [amdgpu]
Code: cf 81 fb e8 03 00 00 74 21 48 8b 44 24 48 65 48 2b 04 25 28 00 00 00 75 15 48 83 c4 50 5b 5d 41 5c 41 5d 41 5e c3 cc cc cc cc <0f> 0b eb db e8 4d 0a 99 eb 66 66 2e 0f 1f 84 00 00 00 00 00 66 90
RSP: 0018:ffff9e3082783cd8 EFLAGS: 00010246
RAX: 0000059c384333ec RBX: 00000000000003e9 RCX: 0000000000000006
RDX: 0000000000161d92 RSI: 0000000000161518 RDI: 0000059c382d165a
RBP: 0000000000000000 R08: 0000000000000000 R09: ffff9e313fbe3900
R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8acb45739870
R13: 0000000000000000 R14: ffff8acb115f6b40 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff8acdeff80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f626f082000 CR3: 00000003cda20000 CR4: 0000000000750ee0
PKRU: 55555554
Call Trace:
 <TASK>
 dc_link_set_psr_allow_active+0x27e/0x3b0 [amdgpu ee83c059d2ae5508d03c4d6f9b6ec7d440e8becc]
 dc_link_handle_hpd_rx_irq+0x318/0x350 [amdgpu ee83c059d2ae5508d03c4d6f9b6ec7d440e8becc]
 handle_hpd_rx_irq+0xca/0x490 [amdgpu ee83c059d2ae5508d03c4d6f9b6ec7d440e8becc]
 ? __schedule+0x44b/0x1400
 process_one_work+0x1c7/0x3d0
 worker_thread+0x51/0x390
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xde/0x110
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x2c/0x50
 </TASK>
---[ end trace 0000000000000000 ]---

If anyone else has this problem and also uses Windows, do you have the BSOD issue?

stuarthayhurst commented 1 year ago

It's PSR again.

https://gitlab.freedesktop.org/drm/amd/-/issues/2443

0x9fff00 commented 1 year ago

@stuarthayhurst Thanks, that does indeed look like the same issue (and maybe also https://gitlab.freedesktop.org/drm/amd/-/issues/2220). I'll try the patch from https://gitlab.freedesktop.org/drm/amd/-/issues/2443#note_1926743

stuarthayhurst commented 1 year ago

To summarise the thread for anyone else with this issue, update to the latest firmware release (20230625) of the linux-firmware git repository. Alternatively, newer kernels should have a patch to disable PSR-SU on systems with older firmware.

If updates aren't an option, use amdgpu.dcdebugmask=0x10 as a kernel parameter to disable PSR entirely.

If the firmware doesn't solve it, it's a different issue that needs to be reported upstream.

stuarthayhurst commented 1 year ago

Can this be closed, as a fix is available with newer firmware?

EDIT: A patch to disable PSR-SU on old firmware made it to 6.5, and the upstream issue was closed

0x9fff00 commented 1 year ago

Can this be closed, as a fix is available with newer firmware?

Probably. I've installed the new firmware and will reopen if the crashes happen again