Closed codebam closed 2 years ago
Normally, driver crashes are clear indicator of a driver and sometimes firmware issues, upp does nothing other than changing some values in the table. It is the driver that interprets these changes and tells SMU/PMU firmware what to do. Can you isolate a particular value change that is causing crash?
Also, kernel, driver and GPU firmware versions? Dmesg output on crash?
It doesn't crash every time, this time it was on the 2nd run, typically it's within 5 runs (2 seconds in between) that it causes a crash.
Kernel: 5.18.9-200.fc36.x86_64 OpenGL renderer string: AMD Radeon RX 6900 XT (sienna_cichlid, LLVM 14.0.0, DRM 3.46, 5.18.9-200.fc36.x86_64) OpenGL version string: 4.6 (Compatibility Profile) Mesa 22.1.3
sudo chmod o+w /sys/class/drm/card0/device/pp_table
upp -d set --write \
smc_pptable/SocketPowerLimitAc/0=303
Dmesg (crashed on 2nd run):
[ 111.554219] amdgpu 0000:0b:00.0: amdgpu: use vbios provided pptable
[ 111.561128] amdgpu 0000:0b:00.0: amdgpu: SMU is initialized successfully!
[ 115.878745] amdgpu 0000:0b:00.0: amdgpu: use vbios provided pptable
[ 120.478046] amdgpu 0000:0b:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000006 SMN_C2PMSG_82:0x00000000
[ 120.478049] amdgpu 0000:0b:00.0: amdgpu: Failed to enable requested dpm features!
[ 120.478050] amdgpu 0000:0b:00.0: amdgpu: Failed to setup smc hw!
[ 120.478051] amdgpu 0000:0b:00.0: amdgpu: smu reset failed, ret = -62
[ 121.267338] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[ 121.523495] [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
[ 126.397379] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=3054, emitted seq=3056
[ 126.397655] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process kwin_wayland pid 2157 thread kwin_wayla:cs0 pid 2191
[ 126.397914] amdgpu 0000:0b:00.0: amdgpu: GPU reset begin!
[ 126.398414] ------------[ cut here ]------------
[ 126.398416] amdgpu 0000:0b:00.0: SMU uninitialized but power ungate requested for 6!
[ 126.398450] WARNING: CPU: 2 PID: 590 at drivers/gpu/drm/amd/amdgpu/../pm/swsmu/amdgpu_smu.c:224 smu_dpm_set_power_gate+0x188/0x1a0 [amdgpu]
[ 126.398729] Modules linked in: uinput snd_seq_dummy rfcomm snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr sunrpc bnep nct6775 hwmon_vid vfat fat snd_hda_codec_realtek snd_hda_codec_generic intel_rapl_msr intel_rapl_common ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi edac_mce_amd snd_usb_audio btusb snd_hda_codec btrtl btbcm snd_usbmidi_lib asus_ec_sensors snd_hda_core btintel snd_rawmidi snd_hwdep snd_seq btmtk snd_seq_device kvm eeepc_wmi bluetooth snd_pcm asus_wmi sparse_keymap platform_profile irqbypass rapl video wmi_bmof pcspkr igb snd_timer k10temp snd ecdh_generic joydev rfkill soundcore dca acpi_cpufreq v4l2loopback(OE) videodev mc i2c_piix4 i2c_dev zram dm_crypt dm_multipath amdgpu iommu_v2 crct10dif_pclmul
[ 126.398783] gpu_sched crc32_pclmul crc32c_intel ccp mxm_wmi drm_dp_helper ghash_clmulni_intel nvme drm_ttm_helper ttm sp5100_tco nvme_core wmi uas usb_storage ip6_tables ip_tables ipmi_devintf ipmi_msghandler fuse
[ 126.398797] CPU: 2 PID: 590 Comm: kworker/u64:5 Tainted: G OE 5.18.9-200.fc36.x86_64 #1
[ 126.398801] Hardware name: System manufacturer System Product Name/PRIME X570-PRO, BIOS 4403 04/27/2022
[ 126.398803] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
[ 126.398810] RIP: 0010:smu_dpm_set_power_gate+0x188/0x1a0 [amdgpu]
[ 126.399086] Code: 85 ed 75 03 48 8b 2f 89 74 24 04 e8 92 1d 26 f3 44 8b 44 24 04 48 89 d9 48 89 ea 48 89 c6 48 c7 c7 e8 c8 9e c0 e8 b3 5f 66 f3 <0f> 0b b8 a1 ff ff ff e9 e4 fe ff ff e9 2f 92 21 00 e9 2a 92 21 00
[ 126.399088] RSP: 0018:ffffbc05c1247c08 EFLAGS: 00010296
[ 126.399091] RAX: 0000000000000048 RBX: ffffffffc0a31570 RCX: 0000000000000000
[ 126.399092] RDX: 0000000000000001 RSI: ffffffffb466ee24 RDI: 00000000ffffffff
[ 126.399094] RBP: ffffa0e101924440 R08: 0000000000000000 R09: ffffbc05c1247a40
[ 126.399096] R10: 0000000000000003 R11: ffffffffb4f453e8 R12: 0000000000000000
[ 126.399097] R13: ffffa0e118d87ba0 R14: ffffa0e118d88d40 R15: 0000000000000001
[ 126.399099] FS: 0000000000000000(0000) GS:ffffa0e80ea80000(0000) knlGS:0000000000000000
[ 126.399101] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 126.399103] CR2: 00007f7d877f9f50 CR3: 00000001a72bc000 CR4: 0000000000350ee0
[ 126.399105] Call Trace:
[ 126.399109] <TASK>
[ 126.399112] amdgpu_dpm_set_powergating_by_smu+0x84/0xe0 [amdgpu]
[ 126.399395] amdgpu_gfx_off_ctrl+0xc5/0x120 [amdgpu]
[ 126.399623] gfx_v10_0_set_powergating_state+0x53/0x200 [amdgpu]
[ 126.399845] amdgpu_device_set_pg_state+0x92/0xe0 [amdgpu]
[ 126.400060] ? wait_task_inactive+0x119/0x170
[ 126.400065] amdgpu_device_ip_suspend_phase1+0x1a/0xc0 [amdgpu]
[ 126.400273] ? drm_sched_increase_karma_ext+0x88/0xc0 [gpu_sched]
[ 126.400278] amdgpu_device_ip_suspend+0x1b/0x60 [amdgpu]
[ 126.400487] amdgpu_device_pre_asic_reset+0xbe/0x260 [amdgpu]
[ 126.400696] amdgpu_device_gpu_recover_imp.cold+0x585/0x8ae [amdgpu]
[ 126.400981] amdgpu_job_timedout+0x153/0x190 [amdgpu]
[ 126.401222] ? __switch_to+0x106/0x420
[ 126.401228] drm_sched_job_timedout+0x72/0x100 [gpu_sched]
[ 126.401234] process_one_work+0x1c7/0x380
[ 126.401238] worker_thread+0x4d/0x380
[ 126.401241] ? _raw_spin_lock_irqsave+0x23/0x50
[ 126.401245] ? process_one_work+0x380/0x380
[ 126.401247] kthread+0xe9/0x110
[ 126.401250] ? kthread_complete_and_exit+0x20/0x20
[ 126.401253] ret_from_fork+0x22/0x30
[ 126.401259] </TASK>
[ 126.401260] ---[ end trace 0000000000000000 ]---
[ 126.908899] [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
[ 129.500639] [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
[ 138.833893] amdgpu 0000:0b:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[ 138.834007] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
[ 139.091014] amdgpu 0000:0b:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[ 139.091123] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[ 139.347199] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
It doesn't crash every time, this time it was on the 2nd run, typically it's within 5 runs (2 seconds in between) that it causes a crash.
Does it ever crash on 1st attempt after cold-boot? Have you tried longer intervals between runs, like, say, 30s? Are you running any other apps or tools that are modifying the amdgpu power states?
Judging by
[ 115.878745] amdgpu 0000:0b:00.0: amdgpu: use vbios provided pptable
[ 120.478046] amdgpu 0000:0b:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000006 SMN_C2PMSG_82:0x00000000
[ 120.478049] amdgpu 0000:0b:00.0: amdgpu: Failed to enable requested dpm features!
[ 120.478050] amdgpu 0000:0b:00.0: amdgpu: Failed to setup smc hw!
[ 120.478051] amdgpu 0000:0b:00.0: amdgpu: smu reset failed, ret = -62
The upp replaced the table successfully, but SMU got stuck for ~5s applying the change. Ultimately, that crashed the GPU:
[ 126.397379] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=3054, emitted seq=3056
[ 126.397655] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process kwin_wayland pid 2157 thread kwin_wayla:cs0 pid 2191
[ 126.397914] amdgpu 0000:0b:00.0: amdgpu: GPU reset begin!
[ 126.398414] ------------[ cut here ]------------
[ 126.398416] amdgpu 0000:0b:00.0: SMU uninitialized but power ungate requested for 6!
[ 126.398450] WARNING: CPU: 2 PID: 590 at drivers/gpu/drm/amd/amdgpu/../pm/swsmu/amdgpu_smu.c:224 smu_dpm_set_power_gate+0x188/0x1a0 [amdgpu]
Likely to be a driver or SMU firmware issue. Not much I can do to fix this, if you feel adventurous please report a kernel bug or look for similar ones already reported.
Does it ever crash on 1st attempt after cold-boot?
Doesn't appear to, no
Have you tried longer intervals between runs, like, say, 30s?
I tried 1 or 2 minutes but it still crashed on the 2nd run
I see okay. Thank you
Does it ever crash on 1st attempt after cold-boot? Have you tried longer intervals between runs, like, say, 30s? Are you running any other apps or tools that are modifying the amdgpu power states?
Kind of replying to a closed issue, but for me it crashes no matter when. Trying to apply an upp table even right after booting will result on a black screen for me, with a similar error message.
it is very likely a kernel bug. https://gitlab.freedesktop.org/drm/amd/-/issues/2060
Applying any settings to my 6900XT, such as the same power limit and tdp as what was already set, causes graphics to become unresponsive and crash, requiring a hard reboot.