tuxedocomputers / linux

This is a read only mirror of this GitLab repository https://gitlab.com/tuxedocomputers/development/packages/linux. For contributions and bug reports please head over to GitLab.
https://gitlab.com/tuxedocomputers/development/packages/linux
Other
10 stars 2 forks source link

AMD integrated gpu crashing for 6.2 #6

Closed robcxyz closed 12 months ago

robcxyz commented 1 year ago

I have a major problem with my pulse gen 2 (5700u) where there is a complete system crash whenever I use anything that relies on heavy use of the GPU. Mainly affecting me when I have multiple youtube tabs open and I try to close one but it is happening to me multiple times a day now requiring a hard restart. I disabled hardware acceleration in chrome but still getting the issue.

The same bug is documented here -> https://gitlab.freedesktop.org/drm/amd/-/issues/2447

Some users are reporting better performance with 6.5 which I was hoping to try out.

Mesa version 23.1.3

Logs:

Sep 11 18:27:45 kernel: [14042.303706] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_high timeout, signaled seq=532423, emitted seq=532426
Sep 11 18:27:45 kernel: [14042.304545] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 4925 thread gnome-shel:cs0 pid 4971
Sep 11 18:27:45 kernel: [14042.305223] amdgpu 0000:05:00.0: amdgpu: GPU reset begin!
Sep 11 18:27:45 kernel: [14042.426391] ------------[ cut here ]------------
Sep 11 18:27:45 kernel: [14042.426396] WARNING: CPU: 7 PID: 55200 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:600 amdgpu_irq_put+0xa4/0xc0 [amdgpu]
Sep 11 18:27:45 kernel: [14042.426753] Modules linked in: xt_nat veth wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel rfcomm dummy nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) nvme_fabrics ccm xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink bridge stp llc cmac overlay algif_hash algif_skcipher af_alg bnep snd_hda_codec_realtek snd_sof_amd_rembrandt snd_sof_amd_renoir snd_hda_codec_generic snd_sof_amd_acp snd_sof_pci snd_sof_xtensa_dsp binfmt_misc snd_hda_codec_hdmi snd_sof snd_hda_intel snd_sof_utils snd_intel_dspcfg snd_intel_sdw_acpi nls_iso8859_1 snd_soc_core snd_hda_codec snd_compress joydev ac97_bus iwlmvm snd_hda_core snd_pcm_dmaengine intel_rapl_msr snd_hwdep snd_pci_ps btusb intel_rapl_common snd_rpl_pci_acp6x btrtl snd_acp_pci
Sep 11 18:27:45 kernel: [14042.426803]  snd_seq_midi snd_pci_acp6x uvcvideo edac_mce_amd btbcm snd_seq_midi_event videobuf2_vmalloc btintel uniwill_wmi(OE) mac80211 snd_rawmidi snd_pcm videobuf2_memops kvm_amd btmtk videobuf2_v4l2 libarc4 tuxedo_io(OE) snd_seq clevo_wmi(OE) snd_pci_acp5x input_leds bluetooth asus_wmi videodev kvm tuxedo_keyboard(OE) iwlwifi videobuf2_common snd_seq_device led_class_multicolor snd_rn_pci_acp3x ledtrig_audio ecdh_generic irqbypass ecc mc rapl hid_multitouch serio_raw platform_profile wmi_bmof snd_timer sparse_keymap snd_acp_config cfg80211 snd_soc_acpi snd k10temp snd_pci_acp3x soundcore ccp mac_hid amd_pmc sch_fq_codel msr parport_pc ppdev lp parport ramoops reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq libcrc32c dm_crypt dm_mirror dm_region_hash dm_log amdgpu iommu_v2 drm_buddy gpu_sched i2c_algo_bit drm_ttm_helper ttm drm_display_helper cec rc_core nvme drm_kms_helper syscopyarea r8169 sysfillrect nvme_core
Sep 11 18:27:45 kernel: [14042.426862]  hid_generic sysimgblt ucsi_acpi crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 aesni_intel crypto_simd drm cryptd typec_ucsi xhci_pci i2c_piix4 xhci_pci_renesas nvme_common realtek typec video i2c_hid_acpi i2c_hid wmi hid
Sep 11 18:27:45 kernel: [14042.426881] CPU: 7 PID: 55200 Comm: kworker/u32:41 Tainted: G        W  OE      6.2.0-10018-tuxedo #23
Sep 11 18:27:45 kernel: [14042.426883] Hardware name: TUXEDO TUXEDO Pulse 15 Gen2/PF5LUXG, BIOS N.1.06A06 06/16/2022
Sep 11 18:27:45 kernel: [14042.426886] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
Sep 11 18:27:45 kernel: [14042.426893] RIP: 0010:amdgpu_irq_put+0xa4/0xc0 [amdgpu]
Sep 11 18:27:45 kernel: [14042.427178] Code: 89 d6 89 d7 e9 3d cf 9e d4 44 89 ea 4c 89 e6 4c 89 f7 e8 8f fc ff ff 5b 41 5c 41 5d 41 5e 5d 31 d2 89 d6 89 d7 e9 1c cf 9e d4 <0f> 0b b8 ea ff ff ff eb c3 b8 ea ff ff ff eb bc b8 fe ff ff ff eb
Sep 11 18:27:45 kernel: [14042.427179] RSP: 0018:ffffbb17848dfbf8 EFLAGS: 00010246
Sep 11 18:27:45 kernel: [14042.427181] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
Sep 11 18:27:45 kernel: [14042.427182] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Sep 11 18:27:45 kernel: [14042.427183] RBP: ffffbb17848dfc18 R08: 0000000000000000 R09: 0000000000000000
Sep 11 18:27:45 kernel: [14042.427184] R10: 0000000000000000 R11: 0000000000000000 R12: ffff918be6730370
Sep 11 18:27:45 kernel: [14042.427185] R13: 0000000000000000 R14: ffff918be6720000 R15: ffff918be6720000
Sep 11 18:27:45 kernel: [14042.427187] FS:  0000000000000000(0000) GS:ffff919a8e5c0000(0000) knlGS:0000000000000000
Sep 11 18:27:45 kernel: [14042.427188] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 11 18:27:45 kernel: [14042.427189] CR2: 000007f6008e5000 CR3: 00000000bb010000 CR4: 0000000000350ee0
Sep 11 18:27:45 kernel: [14042.427190] Call Trace:
Sep 11 18:27:45 kernel: [14042.427192]  <TASK>
Sep 11 18:27:45 kernel: [14042.427196]  sdma_v4_0_hw_fini+0x41/0xc0 [amdgpu]
Sep 11 18:27:45 kernel: [14042.427453]  sdma_v4_0_suspend+0x2c/0x60 [amdgpu]
Sep 11 18:27:45 kernel: [14042.427669]  amdgpu_device_ip_suspend_phase2+0x25d/0x490 [amdgpu]
Sep 11 18:27:45 kernel: [14042.427872]  amdgpu_device_ip_suspend+0x41/0x80 [amdgpu]
Sep 11 18:27:45 kernel: [14042.428075]  amdgpu_device_pre_asic_reset+0xd6/0x4a0 [amdgpu]
Sep 11 18:27:45 kernel: [14042.428279]  amdgpu_device_gpu_recover+0x49f/0xa20 [amdgpu]
Sep 11 18:27:45 kernel: [14042.428469]  amdgpu_job_timedout+0x13a/0x200 [amdgpu]
Sep 11 18:27:45 kernel: [14042.428686]  drm_sched_job_timedout+0x6d/0x120 [gpu_sched]
Sep 11 18:27:45 kernel: [14042.428691]  process_one_work+0x21f/0x440
Sep 11 18:27:45 kernel: [14042.428697]  worker_thread+0x50/0x3f0
Sep 11 18:27:45 kernel: [14042.428699]  ? __pfx_worker_thread+0x10/0x10
Sep 11 18:27:45 kernel: [14042.428701]  kthread+0xee/0x120
Sep 11 18:27:45 kernel: [14042.428704]  ? __pfx_kthread+0x10/0x10
Sep 11 18:27:45 kernel: [14042.428706]  ret_from_fork+0x2c/0x50
Sep 11 18:27:45 kernel: [14042.428710]  </TASK>
Sep 11 18:27:45 kernel: [14042.428710] ---[ end trace 0000000000000000 ]---
Sep 11 18:27:45 kernel: [14042.429420] ------------[ cut here ]------------
Sep 11 18:27:45 kernel: [14042.429421] WARNING: CPU: 7 PID: 55200 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:600 amdgpu_irq_put+0xa4/0xc0 [amdgpu]
Sep 11 18:27:45 kernel: [14042.429629] Modules linked in: xt_nat veth wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel rfcomm dummy nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) nvme_fabrics ccm xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink bridge stp llc cmac overlay algif_hash algif_skcipher af_alg bnep snd_hda_codec_realtek snd_sof_amd_rembrandt snd_sof_amd_renoir snd_hda_codec_generic snd_sof_amd_acp snd_sof_pci snd_sof_xtensa_dsp binfmt_misc snd_hda_codec_hdmi snd_sof snd_hda_intel snd_sof_utils snd_intel_dspcfg snd_intel_sdw_acpi nls_iso8859_1 snd_soc_core snd_hda_codec snd_compress joydev ac97_bus iwlmvm snd_hda_core snd_pcm_dmaengine intel_rapl_msr snd_hwdep snd_pci_ps btusb intel_rapl_common snd_rpl_pci_acp6x btrtl snd_acp_pci
Sep 11 18:27:45 kernel: [14042.429662]  snd_seq_midi snd_pci_acp6x uvcvideo edac_mce_amd btbcm snd_seq_midi_event videobuf2_vmalloc btintel uniwill_wmi(OE) mac80211 snd_rawmidi snd_pcm videobuf2_memops kvm_amd btmtk videobuf2_v4l2 libarc4 tuxedo_io(OE) snd_seq clevo_wmi(OE) snd_pci_acp5x input_leds bluetooth asus_wmi videodev kvm tuxedo_keyboard(OE) iwlwifi videobuf2_common snd_seq_device led_class_multicolor snd_rn_pci_acp3x ledtrig_audio ecdh_generic irqbypass ecc mc rapl hid_multitouch serio_raw platform_profile wmi_bmof snd_timer sparse_keymap snd_acp_config cfg80211 snd_soc_acpi snd k10temp snd_pci_acp3x soundcore ccp mac_hid amd_pmc sch_fq_codel msr parport_pc ppdev lp parport ramoops reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq libcrc32c dm_crypt dm_mirror dm_region_hash dm_log amdgpu iommu_v2 drm_buddy gpu_sched i2c_algo_bit drm_ttm_helper ttm drm_display_helper cec rc_core nvme drm_kms_helper syscopyarea r8169 sysfillrect nvme_core
Sep 11 18:27:45 kernel: [14042.429699]  hid_generic sysimgblt ucsi_acpi crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 aesni_intel crypto_simd drm cryptd typec_ucsi xhci_pci i2c_piix4 xhci_pci_renesas nvme_common realtek typec video i2c_hid_acpi i2c_hid wmi hid
Sep 11 18:27:45 kernel: [14042.429711] CPU: 7 PID: 55200 Comm: kworker/u32:41 Tainted: G        W  OE      6.2.0-10018-tuxedo #23
Sep 11 18:27:45 kernel: [14042.429713] Hardware name: TUXEDO TUXEDO Pulse 15 Gen2/PF5LUXG, BIOS N.1.06A06 06/16/2022
Sep 11 18:27:45 kernel: [14042.429714] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
Sep 11 18:27:45 kernel: [14042.429719] RIP: 0010:amdgpu_irq_put+0xa4/0xc0 [amdgpu]
Sep 11 18:27:45 kernel: [14042.429915] Code: 89 d6 89 d7 e9 3d cf 9e d4 44 89 ea 4c 89 e6 4c 89 f7 e8 8f fc ff ff 5b 41 5c 41 5d 41 5e 5d 31 d2 89 d6 89 d7 e9 1c cf 9e d4 <0f> 0b b8 ea ff ff ff eb c3 b8 ea ff ff ff eb bc b8 fe ff ff ff eb
Sep 11 18:27:45 kernel: [14042.429917] RSP: 0018:ffffbb17848dfc08 EFLAGS: 00010246
Sep 11 18:27:45 kernel: [14042.429919] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
Sep 11 18:27:45 kernel: [14042.429920] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Sep 11 18:27:45 kernel: [14042.429921] RBP: ffffbb17848dfc28 R08: 0000000000000000 R09: 0000000000000000
Sep 11 18:27:45 kernel: [14042.429922] R10: 0000000000000000 R11: 0000000000000000 R12: ffff918be672bef0
Sep 11 18:27:45 kernel: [14042.429923] R13: 0000000000000000 R14: ffff918be6720000 R15: ffff918be6720000
Sep 11 18:27:45 kernel: [14042.429925] FS:  0000000000000000(0000) GS:ffff919a8e5c0000(0000) knlGS:0000000000000000
Sep 11 18:27:45 kernel: [14042.429926] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 11 18:27:45 kernel: [14042.429927] CR2: 000007f6008e5000 CR3: 00000000bb010000 CR4: 0000000000350ee0
Sep 11 18:27:45 kernel: [14042.429929] Call Trace:
Sep 11 18:27:45 kernel: [14042.429930]  <TASK>
Sep 11 18:27:45 kernel: [14042.429931]  gfx_v9_0_hw_fini+0x1f/0x350 [amdgpu]
Sep 11 18:27:45 kernel: [14042.430129]  gfx_v9_0_suspend+0xe/0x20 [amdgpu]
Sep 11 18:27:45 kernel: [14042.430362]  amdgpu_device_ip_suspend_phase2+0x25d/0x490 [amdgpu]
Sep 11 18:27:45 kernel: [14042.430609]  amdgpu_device_ip_suspend+0x41/0x80 [amdgpu]
Sep 11 18:27:45 kernel: [14042.430864]  amdgpu_device_pre_asic_reset+0xd6/0x4a0 [amdgpu]
Sep 11 18:27:45 kernel: [14042.431124]  amdgpu_device_gpu_recover+0x49f/0xa20 [amdgpu]
Sep 11 18:27:45 kernel: [14042.431392]  amdgpu_job_timedout+0x13a/0x200 [amdgpu]
Sep 11 18:27:45 kernel: [14042.431657]  drm_sched_job_timedout+0x6d/0x120 [gpu_sched]
Sep 11 18:27:45 kernel: [14042.431663]  process_one_work+0x21f/0x440
Sep 11 18:27:45 kernel: [14042.431667]  worker_thread+0x50/0x3f0
Sep 11 18:27:45 kernel: [14042.431669]  ? __pfx_worker_thread+0x10/0x10
Sep 11 18:27:45 kernel: [14042.431671]  kthread+0xee/0x120
Sep 11 18:27:45 kernel: [14042.431674]  ? __pfx_kthread+0x10/0x10
Sep 11 18:27:45 kernel: [14042.431676]  ret_from_fork+0x2c/0x50
Sep 11 18:27:45 kernel: [14042.431681]  </TASK>
Sep 11 18:27:45 kernel: [14042.431681] ---[ end trace 0000000000000000 ]---
Sep 11 18:27:45 kernel: [14042.440809] [drm] psp gfx command UNLOAD_TA(0x2) failed and response status is (0x117)
Sep 11 18:27:45 kernel: [14042.467566] ------------[ cut here ]------------
Sep 11 18:27:45 kernel: [14042.467573] WARNING: CPU: 7 PID: 55200 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:600 amdgpu_irq_put+0xa4/0xc0 [amdgpu]
Sep 11 18:27:45 kernel: [14042.468191] Modules linked in: xt_nat veth wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel rfcomm dummy nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) nvme_fabrics ccm xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink bridge stp llc cmac overlay algif_hash algif_skcipher af_alg bnep snd_hda_codec_realtek snd_sof_amd_rembrandt snd_sof_amd_renoir snd_hda_codec_generic snd_sof_amd_acp snd_sof_pci snd_sof_xtensa_dsp binfmt_misc snd_hda_codec_hdmi snd_sof snd_hda_intel snd_sof_utils snd_intel_dspcfg snd_intel_sdw_acpi nls_iso8859_1 snd_soc_core snd_hda_codec snd_compress joydev ac97_bus iwlmvm snd_hda_core snd_pcm_dmaengine intel_rapl_msr snd_hwdep snd_pci_ps btusb intel_rapl_common snd_rpl_pci_acp6x btrtl snd_acp_pci
Sep 11 18:27:45 kernel: [14042.468290]  snd_seq_midi snd_pci_acp6x uvcvideo edac_mce_amd btbcm snd_seq_midi_event videobuf2_vmalloc btintel uniwill_wmi(OE) mac80211 snd_rawmidi snd_pcm videobuf2_memops kvm_amd btmtk videobuf2_v4l2 libarc4 tuxedo_io(OE) snd_seq clevo_wmi(OE) snd_pci_acp5x input_leds bluetooth asus_wmi videodev kvm tuxedo_keyboard(OE) iwlwifi videobuf2_common snd_seq_device led_class_multicolor snd_rn_pci_acp3x ledtrig_audio ecdh_generic irqbypass ecc mc rapl hid_multitouch serio_raw platform_profile wmi_bmof snd_timer sparse_keymap snd_acp_config cfg80211 snd_soc_acpi snd k10temp snd_pci_acp3x soundcore ccp mac_hid amd_pmc sch_fq_codel msr parport_pc ppdev lp parport ramoops reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq libcrc32c dm_crypt dm_mirror dm_region_hash dm_log amdgpu iommu_v2 drm_buddy gpu_sched i2c_algo_bit drm_ttm_helper ttm drm_display_helper cec rc_core nvme drm_kms_helper syscopyarea r8169 sysfillrect nvme_core
Sep 11 18:27:45 kernel: [14042.468404]  hid_generic sysimgblt ucsi_acpi crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 aesni_intel crypto_simd drm cryptd typec_ucsi xhci_pci i2c_piix4 xhci_pci_renesas nvme_common realtek typec video i2c_hid_acpi i2c_hid wmi hid
Sep 11 18:27:45 kernel: [14042.468438] CPU: 7 PID: 55200 Comm: kworker/u32:41 Tainted: G        W  OE      6.2.0-10018-tuxedo #23
Sep 11 18:27:45 kernel: [14042.468443] Hardware name: TUXEDO TUXEDO Pulse 15 Gen2/PF5LUXG, BIOS N.1.06A06 06/16/2022
Sep 11 18:27:45 kernel: [14042.468448] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
Sep 11 18:27:45 kernel: [14042.468463] RIP: 0010:amdgpu_irq_put+0xa4/0xc0 [amdgpu]
Sep 11 18:27:45 kernel: [14042.469054] Code: 89 d6 89 d7 e9 3d cf 9e d4 44 89 ea 4c 89 e6 4c 89 f7 e8 8f fc ff ff 5b 41 5c 41 5d 41 5e 5d 31 d2 89 d6 89 d7 e9 1c cf 9e d4 <0f> 0b b8 ea ff ff ff eb c3 b8 ea ff ff ff eb bc b8 fe ff ff ff eb
Sep 11 18:27:45 kernel: [14042.469058] RSP: 0018:ffffbb17848dfc18 EFLAGS: 00010246
Sep 11 18:27:45 kernel: [14042.469062] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
Sep 11 18:27:45 kernel: [14042.469065] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Sep 11 18:27:45 kernel: [14042.469067] RBP: ffffbb17848dfc38 R08: 0000000000000000 R09: 0000000000000000
Sep 11 18:27:45 kernel: [14042.469069] R10: 0000000000000000 R11: 0000000000000000 R12: ffff918be67224d8
Sep 11 18:27:45 kernel: [14042.469071] R13: 0000000000000000 R14: ffff918be6720000 R15: ffff918be6720000
Sep 11 18:27:45 kernel: [14042.469074] FS:  0000000000000000(0000) GS:ffff919a8e5c0000(0000) knlGS:0000000000000000
Sep 11 18:27:45 kernel: [14042.469078] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 11 18:27:45 kernel: [14042.469080] CR2: 000007f6008e5000 CR3: 00000000bb010000 CR4: 0000000000350ee0
Sep 11 18:27:45 kernel: [14042.469084] Call Trace:
Sep 11 18:27:45 kernel: [14042.469086]  <TASK>
Sep 11 18:27:45 kernel: [14042.469092]  gmc_v9_0_hw_fini+0x6a/0xb0 [amdgpu]
Sep 11 18:27:45 kernel: [14042.469691]  gmc_v9_0_suspend+0xe/0x20 [amdgpu]
Sep 11 18:27:45 kernel: [14042.470276]  amdgpu_device_ip_suspend_phase2+0x25d/0x490 [amdgpu]
Sep 11 18:27:45 kernel: [14042.470859]  amdgpu_device_ip_suspend+0x41/0x80 [amdgpu]
Sep 11 18:27:45 kernel: [14042.471407]  amdgpu_device_pre_asic_reset+0xd6/0x4a0 [amdgpu]
Sep 11 18:27:45 kernel: [14042.471955]  amdgpu_device_gpu_recover+0x49f/0xa20 [amdgpu]
Sep 11 18:27:45 kernel: [14042.472501]  amdgpu_job_timedout+0x13a/0x200 [amdgpu]
Sep 11 18:27:45 kernel: [14042.473174]  drm_sched_job_timedout+0x6d/0x120 [gpu_sched]
Sep 11 18:27:45 kernel: [14042.473190]  process_one_work+0x21f/0x440
Sep 11 18:27:45 kernel: [14042.473201]  worker_thread+0x50/0x3f0
Sep 11 18:27:45 kernel: [14042.473206]  ? __pfx_worker_thread+0x10/0x10
Sep 11 18:27:45 kernel: [14042.473211]  kthread+0xee/0x120
Sep 11 18:27:45 kernel: [14042.473218]  ? __pfx_kthread+0x10/0x10
Sep 11 18:27:45 kernel: [14042.473224]  ret_from_fork+0x2c/0x50
Sep 11 18:27:45 kernel: [14042.473233]  </TASK>
Sep 11 18:27:45 kernel: [14042.473235] ---[ end trace 0000000000000000 ]---
Sep 11 18:27:45 kernel: [14042.473291] amdgpu 0000:05:00.0: amdgpu: MODE2 reset
Sep 11 18:27:45 kernel: [14042.473801] amdgpu 0000:05:00.0: amdgpu: GPU reset succeeded, trying to resume
Sep 11 18:27:45 kernel: [14042.474175] [drm] PCIE GART of 1024M enabled.
Sep 11 18:27:45 kernel: [14042.474179] [drm] PTB located at 0x000000F41FC00000
Sep 11 18:27:45 kernel: [14042.474304] [drm] PSP is resuming...
Sep 11 18:27:46 kernel: [14043.317949] [drm] reserve 0x400000 from 0xf41f800000 for PSP TMR
Sep 11 18:27:46 kernel: [14043.596040] amdgpu 0000:05:00.0: amdgpu: RAS: optional ras ta ucode is not available
Sep 11 18:27:46 kernel: [14043.604886] amdgpu 0000:05:00.0: amdgpu: RAP: optional rap ta ucode is not available
Sep 11 18:27:46 kernel: [14043.609498] [drm] psp gfx command LOAD_TA(0x1) failed and response status is (0x7)
Sep 11 18:27:46 kernel: [14043.609720] [drm] psp gfx command INVOKE_CMD(0x3) failed and response status is (0x4)
Sep 11 18:27:46 kernel: [14043.609725] amdgpu 0000:05:00.0: amdgpu: Secure display: Generic Failure.
Sep 11 18:27:46 kernel: [14043.609731] amdgpu 0000:05:00.0: amdgpu: SECUREDISPLAY: query securedisplay TA failed. ret 0x0
Sep 11 18:27:46 kernel: [14043.609738] amdgpu 0000:05:00.0: amdgpu: SMU is resuming...
Sep 11 18:27:46 kernel: [14043.611810] amdgpu 0000:05:00.0: amdgpu: SMU is resumed successfully!
Sep 11 18:27:46 kernel: [14043.612265] [drm] DMUB hardware initialized: version=0x01010026
Sep 11 18:27:47 kernel: [14044.146259] [drm] kiq ring mec 2 pipe 1 q 0
Sep 11 18:27:47 kernel: [14044.387558] amdgpu 0000:05:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Sep 11 18:27:47 kernel: [14044.388143] [drm:amdgpu_gfx_enable_kcq [amdgpu]] *ERROR* KCQ enable failed
Sep 11 18:27:47 kernel: [14044.388763] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v9_0> failed -110
Sep 11 18:27:47 kernel: [14044.389347] amdgpu 0000:05:00.0: amdgpu: GPU reset(2) failed
Sep 11 18:27:47 kernel: [14044.389637] amdgpu 0000:05:00.0: amdgpu: GPU reset end with ret = -110
Sep 11 18:27:47 kernel: [14044.389648] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -110
robcxyz commented 1 year ago

Hi, just as an update for this issue, it is still not resolved but not nearly as bad as before after opening laptop (pulse gen 2) and removing / cleaning ram / ssd connections. Originally it was crashing computer multiple times a day and now I got it down to roughly once or at most twice a day.

Would probably still keep this open since there seems to be an error with the AMD drivers per the issue linked (it is very much actively being worked on with kernel devs) but it is not as severe as initially noted now.

robcxyz commented 1 year ago

Another update, system is now basically unusable when it is under heavy load. For some reason using alt+tab is always the trigger and since that is fully baked into muscle memory and part of my second by second workflow, it is pretty hard to not crash the system.

Please, any suggestions here?

Matombo commented 1 year ago

Hi, might sound a bit unreleated, but can you try to unplug and replug your nvme and ram a couple of times and see if that makes a difference?

robcxyz commented 1 year ago

@Matombo - Thank you so much for getting back to me. Really appreciated since this has been a multiple times a day issue...

So I actually did that already and indeed it seemed like it helped initially but then the bug came back. I did that because I saw some other errors come up relating to losing write permissions to my hd and so I thought it was just related to the contacts of the drive, not the GPU itself. Was quite annoying though as I'd lose about 10 min of work since my IDE was not saving changes.

I'll give that a try though and report back on this thread. Thanks again for the help.

robcxyz commented 1 year ago

@Matombo - Really jiggled everything around a firmly set ram / nvme in their places. Still have the issue.

robcxyz commented 12 months ago

FYI - I think the new 6.5 kernel fixed this. System stable today - no crashes since. Will close this issue if I survive another couple days without a crash.

robcxyz commented 12 months ago

Yeah I am basically sure the 6.5 kernel fixed this issue now. Closing.