sibradzic / amdgpu-clocks

Simple script to control power states of amdgpu driven GPUs
GNU General Public License v2.0
390 stars 43 forks source link

Stopping amdgpu-clocks service causes system crash #24

Closed adolfintel closed 3 years ago

adolfintel commented 3 years ago

When the service is stopped, the system locks up, as shown in this video: https://user-images.githubusercontent.com/16307689/104839896-7f72db00-58c4-11eb-9039-2270bbb25d9c.mp4

Sometimes a loud buzzing noise is emitted from the HDMI/DP output.

This makes it impossible to properly restart or shut down the machine.

This is on Manjaro, Linux 5.10, Polaris GPU, AMDGPU driver

sibradzic commented 3 years ago

Service stop does nothing but amdgpu-clocks restore (please confirm), which just sets the states to the values before amdgpu-clocks is run for the first time. Crash is very likely caused by particular values in the "restore" file, so please share the contents of /tmp/amdgpu-custom-state.card0.initial (case your Polaris card is card0).

adolfintel commented 3 years ago

Yes, amdgpu-clocks restore does the same thing.

The clocks and voltages in the restore file match the ones in the VBIOS:

cat /tmp/amdgpu-custom-states.card0.initial 
OD_SCLK:
0:        214MHz        700mV
1:        551MHz        750mV
2:        734MHz        750mV
3:        980MHz        875mV
4:       1046MHz        900mV
5:       1098MHz        950mV
6:       1124MHz       1000mV
7:       1206MHz       1050mV
OD_MCLK:
0:        300MHz        700mV
1:        625MHz        700mV
2:       1500MHz        875mV
OD_RANGE:
SCLK:     214MHz       1800MHz
MCLK:     300MHz       2000MHz
VDDC:     700mV        1075mV
FORCE_PERF_LEVEL: auto
FORCE_POWER_CAP: 36000000

This is my custom states file, which is perfectly stable:

# Set custom GPU states:
OD_SCLK:
7:       1600MHz        1075mV
# Set custom memory states:
OD_MCLK:
2:       2000MHz        875mV
# Only allow SCLK state 7:
FORCE_SCLK: 7
# Force fixed memory state 2:
FORCE_MCLK: 2
# Force power limit (in micro watts):
FORCE_POWER_CAP: 50000000
# In order to allow FORCE_SCLK & FORCE_MCLK:
FORCE_PERF_LEVEL: manual
sibradzic commented 3 years ago

It looks like the driver or your card really does not like the transition of your custom states to defaults found at /tmp/amdgpu-custom-states.card0.initial. Try commenting out all MCLK & FORCE_MCLK parts, first in custom states and then in initial states and report back.

There is probably nothing wrong with the amdgpu-clocks, it is your particular settings and the driver's or HW's inability to transition back to original state...

adolfintel commented 3 years ago

Apparently the problem is caused by FORCE_SCLK and FORCE_MCLK, if I force them both to 0 before resetting, it works as intended. Would it be possible to add this to the script?

adolfintel commented 3 years ago

Yeah, it looks like this is it, I can replicate the problem in Windows. If I force it to the highest power state with the radeon software and press the reset button in afterburner it does the same thing, apparently you can't write a power state while it's in use

sibradzic commented 3 years ago

OK, what seems to be happening in your case is that the dirver/card crashes when transitioning from manual to auto FORCE_PERF_LEVEL when manual is restricted to highest states only. Your patch would break other things (on Navi one can't set SYSDPM{S,M}CLK when not in manual, doing it blindly in set_custom_states brings risks of further breakage and commit is not necessary when setting SYSDPM{S,M}CLK). Can you test this slightly different approach:

@@ -124,6 +124,13 @@ function parse_states() {
 }

 function set_custom_states() {
+  if [ "${RESTORE}" == "true" ] && [ "${FORCE_LEVEL}" == "auto" ]; then
+    echo manual > ${PWR_LEVEL}
+    sleep 0.25
+    echo 0 > ${SYS_DPM_SCLK} 2>&1
+    echo 0 > ${SYS_DPM_MCLK} 2>&1
+    sleep 0.25
+  fi
   if [ "${FORCE_LEVEL}" ]; then
     echo ${FORCE_LEVEL} > ${PWR_LEVEL}
   fi

?

adolfintel commented 3 years ago

Yes, that works. Those extra commits were there because I noticed that they removed the forced state, but your solution works too so I'll add that to my PR.