Applying any settings on 6900XT causes a crash

codebam commented 2 years ago

Applying any settings to my 6900XT, such as the same power limit and tdp as what was already set, causes graphics to become unresponsive and crash, requiring a hard reboot.

upp -d set --write \
    smc_pptable/SocketPowerLimitAc/0=303 \
    smc_pptable/SocketPowerLimitDc/0=272 \
    smc_pptable/TdcLimit/0=320 \
    smc_pptable/TdcLimit/1=55

sibradzic commented 2 years ago

Normally, driver crashes are clear indicator of a driver and sometimes firmware issues, upp does nothing other than changing some values in the table. It is the driver that interprets these changes and tells SMU/PMU firmware what to do. Can you isolate a particular value change that is causing crash?

Also, kernel, driver and GPU firmware versions? Dmesg output on crash?

codebam commented 2 years ago

It doesn't crash every time, this time it was on the 2nd run, typically it's within 5 runs (2 seconds in between) that it causes a crash.

Kernel: 5.18.9-200.fc36.x86_64 OpenGL renderer string: AMD Radeon RX 6900 XT (sienna_cichlid, LLVM 14.0.0, DRM 3.46, 5.18.9-200.fc36.x86_64) OpenGL version string: 4.6 (Compatibility Profile) Mesa 22.1.3

sudo chmod o+w /sys/class/drm/card0/device/pp_table
upp -d set --write \
    smc_pptable/SocketPowerLimitAc/0=303

Dmesg (crashed on 2nd run):

[  111.554219] amdgpu 0000:0b:00.0: amdgpu: use vbios provided pptable
[  111.561128] amdgpu 0000:0b:00.0: amdgpu: SMU is initialized successfully!
[  115.878745] amdgpu 0000:0b:00.0: amdgpu: use vbios provided pptable
[  120.478046] amdgpu 0000:0b:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000006 SMN_C2PMSG_82:0x00000000
[  120.478049] amdgpu 0000:0b:00.0: amdgpu: Failed to enable requested dpm features!
[  120.478050] amdgpu 0000:0b:00.0: amdgpu: Failed to setup smc hw!
[  120.478051] amdgpu 0000:0b:00.0: amdgpu: smu reset failed, ret = -62
[  121.267338] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[  121.523495] [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
[  126.397379] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=3054, emitted seq=3056
[  126.397655] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process kwin_wayland pid 2157 thread kwin_wayla:cs0 pid 2191
[  126.397914] amdgpu 0000:0b:00.0: amdgpu: GPU reset begin!
[  126.398414] ------------[ cut here ]------------
[  126.398416] amdgpu 0000:0b:00.0: SMU uninitialized but power ungate requested for 6!
[  126.398450] WARNING: CPU: 2 PID: 590 at drivers/gpu/drm/amd/amdgpu/../pm/swsmu/amdgpu_smu.c:224 smu_dpm_set_power_gate+0x188/0x1a0 [amdgpu]
[  126.398729] Modules linked in: uinput snd_seq_dummy rfcomm snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr sunrpc bnep nct6775 hwmon_vid vfat fat snd_hda_codec_realtek snd_hda_codec_generic intel_rapl_msr intel_rapl_common ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi edac_mce_amd snd_usb_audio btusb snd_hda_codec btrtl btbcm snd_usbmidi_lib asus_ec_sensors snd_hda_core btintel snd_rawmidi snd_hwdep snd_seq btmtk snd_seq_device kvm eeepc_wmi bluetooth snd_pcm asus_wmi sparse_keymap platform_profile irqbypass rapl video wmi_bmof pcspkr igb snd_timer k10temp snd ecdh_generic joydev rfkill soundcore dca acpi_cpufreq v4l2loopback(OE) videodev mc i2c_piix4 i2c_dev zram dm_crypt dm_multipath amdgpu iommu_v2 crct10dif_pclmul
[  126.398783]  gpu_sched crc32_pclmul crc32c_intel ccp mxm_wmi drm_dp_helper ghash_clmulni_intel nvme drm_ttm_helper ttm sp5100_tco nvme_core wmi uas usb_storage ip6_tables ip_tables ipmi_devintf ipmi_msghandler fuse
[  126.398797] CPU: 2 PID: 590 Comm: kworker/u64:5 Tainted: G           OE     5.18.9-200.fc36.x86_64 #1
[  126.398801] Hardware name: System manufacturer System Product Name/PRIME X570-PRO, BIOS 4403 04/27/2022
[  126.398803] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
[  126.398810] RIP: 0010:smu_dpm_set_power_gate+0x188/0x1a0 [amdgpu]
[  126.399086] Code: 85 ed 75 03 48 8b 2f 89 74 24 04 e8 92 1d 26 f3 44 8b 44 24 04 48 89 d9 48 89 ea 48 89 c6 48 c7 c7 e8 c8 9e c0 e8 b3 5f 66 f3 <0f> 0b b8 a1 ff ff ff e9 e4 fe ff ff e9 2f 92 21 00 e9 2a 92 21 00
[  126.399088] RSP: 0018:ffffbc05c1247c08 EFLAGS: 00010296
[  126.399091] RAX: 0000000000000048 RBX: ffffffffc0a31570 RCX: 0000000000000000
[  126.399092] RDX: 0000000000000001 RSI: ffffffffb466ee24 RDI: 00000000ffffffff
[  126.399094] RBP: ffffa0e101924440 R08: 0000000000000000 R09: ffffbc05c1247a40
[  126.399096] R10: 0000000000000003 R11: ffffffffb4f453e8 R12: 0000000000000000
[  126.399097] R13: ffffa0e118d87ba0 R14: ffffa0e118d88d40 R15: 0000000000000001
[  126.399099] FS:  0000000000000000(0000) GS:ffffa0e80ea80000(0000) knlGS:0000000000000000
[  126.399101] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  126.399103] CR2: 00007f7d877f9f50 CR3: 00000001a72bc000 CR4: 0000000000350ee0
[  126.399105] Call Trace:
[  126.399109]  <TASK>
[  126.399112]  amdgpu_dpm_set_powergating_by_smu+0x84/0xe0 [amdgpu]
[  126.399395]  amdgpu_gfx_off_ctrl+0xc5/0x120 [amdgpu]
[  126.399623]  gfx_v10_0_set_powergating_state+0x53/0x200 [amdgpu]
[  126.399845]  amdgpu_device_set_pg_state+0x92/0xe0 [amdgpu]
[  126.400060]  ? wait_task_inactive+0x119/0x170
[  126.400065]  amdgpu_device_ip_suspend_phase1+0x1a/0xc0 [amdgpu]
[  126.400273]  ? drm_sched_increase_karma_ext+0x88/0xc0 [gpu_sched]
[  126.400278]  amdgpu_device_ip_suspend+0x1b/0x60 [amdgpu]
[  126.400487]  amdgpu_device_pre_asic_reset+0xbe/0x260 [amdgpu]
[  126.400696]  amdgpu_device_gpu_recover_imp.cold+0x585/0x8ae [amdgpu]
[  126.400981]  amdgpu_job_timedout+0x153/0x190 [amdgpu]
[  126.401222]  ? __switch_to+0x106/0x420
[  126.401228]  drm_sched_job_timedout+0x72/0x100 [gpu_sched]
[  126.401234]  process_one_work+0x1c7/0x380
[  126.401238]  worker_thread+0x4d/0x380
[  126.401241]  ? _raw_spin_lock_irqsave+0x23/0x50
[  126.401245]  ? process_one_work+0x380/0x380
[  126.401247]  kthread+0xe9/0x110
[  126.401250]  ? kthread_complete_and_exit+0x20/0x20
[  126.401253]  ret_from_fork+0x22/0x30
[  126.401259]  </TASK>
[  126.401260] ---[ end trace 0000000000000000 ]---
[  126.908899] [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
[  129.500639] [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
[  138.833893] amdgpu 0000:0b:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[  138.834007] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
[  139.091014] amdgpu 0000:0b:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[  139.091123] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[  139.347199] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx

sibradzic commented 2 years ago

It doesn't crash every time, this time it was on the 2nd run, typically it's within 5 runs (2 seconds in between) that it causes a crash.

Does it ever crash on 1st attempt after cold-boot? Have you tried longer intervals between runs, like, say, 30s? Are you running any other apps or tools that are modifying the amdgpu power states?

Judging by

[  115.878745] amdgpu 0000:0b:00.0: amdgpu: use vbios provided pptable
[  120.478046] amdgpu 0000:0b:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000006 SMN_C2PMSG_82:0x00000000
[  120.478049] amdgpu 0000:0b:00.0: amdgpu: Failed to enable requested dpm features!
[  120.478050] amdgpu 0000:0b:00.0: amdgpu: Failed to setup smc hw!
[  120.478051] amdgpu 0000:0b:00.0: amdgpu: smu reset failed, ret = -62

The upp replaced the table successfully, but SMU got stuck for ~5s applying the change. Ultimately, that crashed the GPU:

[  126.397379] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=3054, emitted seq=3056
[  126.397655] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process kwin_wayland pid 2157 thread kwin_wayla:cs0 pid 2191
[  126.397914] amdgpu 0000:0b:00.0: amdgpu: GPU reset begin!
[  126.398414] ------------[ cut here ]------------
[  126.398416] amdgpu 0000:0b:00.0: SMU uninitialized but power ungate requested for 6!
[  126.398450] WARNING: CPU: 2 PID: 590 at drivers/gpu/drm/amd/amdgpu/../pm/swsmu/amdgpu_smu.c:224 smu_dpm_set_power_gate+0x188/0x1a0 [amdgpu]

Likely to be a driver or SMU firmware issue. Not much I can do to fix this, if you feel adventurous please report a kernel bug or look for similar ones already reported.

codebam commented 2 years ago

Does it ever crash on 1st attempt after cold-boot?

Doesn't appear to, no

Have you tried longer intervals between runs, like, say, 30s?

I tried 1 or 2 minutes but it still crashed on the 2nd run

I see okay. Thank you

Kodehawa commented 11 months ago

Does it ever crash on 1st attempt after cold-boot? Have you tried longer intervals between runs, like, say, 30s? Are you running any other apps or tools that are modifying the amdgpu power states?

Kind of replying to a closed issue, but for me it crashes no matter when. Trying to apply an upp table even right after booting will result on a black screen for me, with a similar error message.

DianaNites commented 11 months ago

it is very likely a kernel bug. https://gitlab.freedesktop.org/drm/amd/-/issues/2060

sibradzic / upp

Applying any settings on 6900XT causes a crash #32