sibradzic / upp

A tool for parsing, dumping and modifying data in Radeon PowerPlay tables
GNU General Public License v3.0
154 stars 24 forks source link

6800XT changing UCLK locks SCLK to 500Mhz #38

Closed DiabeticCrab closed 1 year ago

DiabeticCrab commented 1 year ago

I tried to increase the max and applied mclk of my card (ref. edition, by default it's 1075 Mhz) like this:

upp --pp-file=/sys/class/drm/card0/device/pp_table set --write \
overdrive_table/max/6=1100 \
overdrive_table/max/7=1100 \
smc_pptable/FreqTableUclk/3=1100

The memory clock does get applied under load, but SCLK is now stuck at 500Mhz. pp_od_clk_voltage lists this:

OD_SCLK:
0: 500Mhz
1: 2444Mhz

When I then use upp to restore the default values:

upp --pp-file=/sys/class/drm/card0/device/pp_table set --write \
overdrive_table/max/6=1075 \
overdrive_table/max/7=1075 \
smc_pptable/FreqTableUclk/3=1000

SCLK is still stuck at 500Mhz, but additionally pp_od_clk_voltage OD_SCLK/1 also lists 500Mhz as the max. possible value now:

OD_SCLK:
0: 500Mhz
1: 500Mhz

When I set the default values with upp, the sha256 sums of the unmodified and the "modified" pp_tables DO match.

Do you have any idea on how to circumvent this behavior and force the card into behaving properly? I assume that the driver may be doing funky stuff (kernel 6.1.0).

sibradzic commented 1 year ago

Hi @DiabeticCrab Please see #25, seems to be related.

When I then use upp to restore the default values ... SCLK is still stuck at 500Mhz. When I set the default values with upp, the sha256 sums of the unmodified and the "modified" pp_tables DO match I assume that the driver may be doing funky stuff (kernel 6.1.0).

Doing certain changes in pp_table can totally screw the driver power/clock/voltage management, and just reverting the pp_table to default normally does not "unscrew" it (more often than not it makes it even worse). This is all about driver, by no means upp's fault. From what I've seen, problems like these happen more often with non-mainline or amdgpu-pro drivers, especially if built against not so recent kernels, but sometimes even new mainline kernel release can screw things up... Try experimenting with more conservative values, or applying one change at a time and see how it goes and if you can isolate which particular value change makes the driver go nuts.

DiabeticCrab commented 1 year ago

So I've been digging a bit and reading through parts of the amdgpu driver code. Compared to GCN it seems like clockspeed control, overdrive and applying pp_tables is handled by the SMU firmware since RDNA2.

When setting a pp_table or using the overdrive controls, the driver does minimal input and value range checking, puts the data into the buffer and calls the SMU with a message which then copies the data over to a private VRAM area, probably performs some more checks and then applies it.

The behaviour I've been seeing likely is the card entering some kind of "safe mode" aka an artificial lockdown :(

For anyone interested in looking further into the matter or experimenting, I'll leave my findings here. I have not tested lifting value range restrictions and relaxing checks within the driver yet, so there may be a chance to get this working. Here's what the call chain looks like for RX6800 and RX 69X0 series from a file level perspective:

Relevant function names:

amdgpu_smu.c (Sets up SCPM handling (PSP verified pp_table). It seems like this could be turned off for RDNA2, since the functionality was added to the driver later during the product life. RDNA3 likely demands this to be on at all times.)

sienna_cichlid_ppt.c

smu_v11_0.c

smu_cmn.c

Here is a rough overview where data structure definitions can be found: