sibradzic / upp

A tool for parsing, dumping and modifying data in Radeon PowerPlay tables
GNU General Public License v3.0
154 stars 24 forks source link

Memory clock and documentation #20

Closed nanom1t closed 2 years ago

nanom1t commented 3 years ago

Hello.

I'm trying to use Uplift Power Play with AMD Navi GPUs for overclocking and undervolting. It looks good, but I can't find any detailed documentation about options/parameters and values. I've found few examples, but looks like they are specific for each GPU (RX5500/5600/5700/etc).

I need to set the next options: 1) Core clock - smc_pptable/FreqTableGfx/1=1500. I found examples with direct writing to pp_od_clk_voltage, but it just freezes in the console.

echo "s 1 1500" > /sys/class/drm/card1/device/pp_od_clk_voltage
echo "s 1 1200 900" > /sys/class/drm/card1/device/pp_od_clk_voltage

2) Core voltage - smc_pptable/MinVoltageGfx and smc_pptable/MaxVoltageGfx. 3) Memory clock - I can't find any examples for UPP. I found example with pp_od_clk_voltage, but it does not work for my GPU. 4) Memory voltage:

smc_pptable/MemMvddVoltage/1=4800
smc_pptable/MemMvddVoltage/2=4800
smc_pptable/MemMvddVoltage/3=4800

5) Memory controller voltage:

smc_pptable/MemVddciVoltage/1=2800
smc_pptable/MemVddciVoltage/2=2800
smc_pptable/MemVddciVoltage/3=2800

6) SOC clock and voltage

There also a lot of params, but how to figure out what do they do?

Could you please help with this?

sibradzic commented 3 years ago

I'm trying to use Uplift Power Play with AMD Navi GPUs for overclocking and undervolting. It looks good, but I can't find any detailed documentation about options/parameters and values. I've found few examples, but looks like they are specific for each GPU (RX5500/5600/5700/etc).

That is because it is very true; the PP table parameters are VERY specific for each card. On Polaris and Vega 56/64 generations it used to be bit more simple, and from Radeon VII onwards the structure and parameters are progressively gaining in complexity. This had become complex to the extent that even the very same GPU silicon can have different "capabilities" enabled, depending on VBIOS, so while changing some PP parameter on one card has some meaningful effect, it may have no effect at all on another card.

The purpose of this project is to be able to analyse, extract and modify any PP table parameter, even for future cards. Interpreting or documenting each and every possible PP table parameter is practically impossible, because even Linux kernel developers are not sure what most of them represent, and AMD doesn't really bother to provide any meaningful documentation.

First of all, which card and which kernel are you using?

  1. Core clock - smc_pptable/FreqTableGfx/1=1500. I found examples with direct writing to pp_od_clk_voltage, but it just freezes in the console.

Not sure what are you trying to show here, your example is using pp_od_clk_voltage interface, not upp. What command exactly "froze your console"? Please be elaborate...

  1. Core voltage - smc_pptable/MinVoltageGfx and smc_pptable/MaxVoltageGfx.

Have you tried changing these? Best would be to check the default values (dump command) and proceed from there, in conservative steps...

  1. Memory clock - I can't find any examples for UPP. I found example with pp_od_clk_voltage, but it does not work for my GPU.

On my RX5700 these are controlled by smc_pptable/FreqTableUclk/{0,1,2,3}, but careful there, despite what people say, the memory controller on Navi cards would only accept very limited set of VRAM clock values. On my card most of the random clock values would crash the driver and freeze the system, but these (and possibly some others) would work: 430, 500, 525, 625, 700, 725... Perhaps changing VRAM clock works better on Navi2? I guess the only way to know for sure on which clocks your VRAM can run is a trial by error.

  1. Memory voltage
  2. Memory controller voltage

I guess you're right; random Google search points to this: https://mineros.info/en/faq/25 VDDCI (mV) - is the I/O bus voltage (between memory and GPU core) and comes from the PCI-Express slot MVDD (mV) - is the memory voltage Note that most voltages in RadeonVII/Navi/Navi2 PP tables are quadruples of voltage in millivolts, so value like 4800 means 4800 / 4 = 1200 mV.

  1. SOC clock and voltage There also a lot of params, but how to figure out what do they do? smc_pptable/SocketPowerLimitAc/0=100 smc_pptable/SocketPowerLimitDc/0=100 smc_pptable/VcBtcEnabled=0 smc_pptable/dBtcGbGfxDfllModelSelect=2 smc_pptable/DpmDescriptor/0/VoltageMode=2 smc_pptable/FreqTableUclk/3

As you can see these things are very chip and even VBIOS specific. I can not possibly document these values for all cards out there, especially those cards I don't own. I could try to document lots of it if someone would be kind enough to provide some hardware for testing...

People like you could test some significant parameters out there and contribute to improve this project's documentation.

nanom1t commented 3 years ago

Thank you for the reply and this project. I'm using Gigabyte RX 5500 XT GPU on Ubuntu 20.04 LTS and AMD drivers v20.40.

I understand that it is very difficult to document every possible PP table parameter and will take a lot of time/efforts. I just mean to document few main parameters (for example, core clock, core voltage, memory clock, memory voltage, soc clock/voltage, power limit, etc).

Not sure what are you trying to show here, your example is using pp_od_clk_voltage interface, not upp. What command exactly "froze your console"? Please be elaborate...

I'm trying to set GPU core clock here and I've found an example with pp_od_clk_voltage interface:

echo "s 1 1500" > /sys/class/drm/card1/device/pp_od_clk_voltage
echo "s 1 1200 900" > /sys/class/drm/card1/device/pp_od_clk_voltage

But it does not work in my case. Just after entering the first command the console stops to response and I can't stop it even with ^C. So I'm looking for UPP alternative and looks like it is smc_pptable/FreqTableGfx/1=1500, but there is no documentation and I don't know if this is the right parameter to set core clock or should I set few other parameters.

Core voltage - smc_pptable/MinVoltageGfx and smc_pptable/MaxVoltageGfx. Have you tried changing these? Best would be to check the default values (dump command) and proceed from there, in conservative steps...

The same case here... I need to set core voltage to undervolt GPU and I tried to use smc_pptable/MaxVoltageGfx/smc_pptable/MinVoltageGfx params. It sets without any errors, but I don't see any serious impact on power usage as it was on Polaris GPUs, so I'm not sure if I'm using it right or should I set few other parameters.

On my RX5700 these are controlled by smc_pptable/FreqTableUclk/{0,1,2,3}

Thanks for this example for memory clock. I only found examples with overdrive_table/max/8=XXX, but it only change OD_RANGE and don't actually change memory clock.

SocketPowerLimitAc & SocketPowerLimitDc - should be obvious, it is the total power limit of the GPU silicon. Set both values > to be sure. Note that on my RX5700 setting PP table Ac & Dc limits was not enough, the driver was still observing the power >limit set by hwmon, which I had to override as well: cat /sys/class/hwmon/$HWMON_ID/power1_cap

Yes I'm using power1_cap to set power limit and it works. Is there will be the same result if I set SocketPowerLimitAc/SocketPowerLimitDc equal to power1_cap?

sibradzic commented 3 years ago

But it does not work in my case. Just after entering the first command the console stops to response and I can't stop it even with ^C. So I'm looking for UPP alternative and looks like it is smc_pptable/FreqTableGfx/1=1500, but there is no documentation and I don't know if this is the right parameter to set core clock or should I set few other parameters.

Yes, smc_pptable/FreqTableGfx/1 should be the correct way to control the clock part of the "top-right point" of the clock-voltage curve, check it out and report back if it works for you...

The same case here... I need to set core voltage to undervolt GPU and I tried to use smc_pptable/MaxVoltageGfx/smc_pptable/MinVoltageGfx params. It sets without any errors, but I don't see any serious impact on power usage as it was on Polaris GPUs, so I'm not sure if I'm using it right or should I set few other parameters.

How do you observe an actual power usage? Setting MaxVoltageGfx in tandem with smc_pptable/FreqTableGfx/1 should give you good control of upper voltage & clock limit (these two values combined together should define the max GFX power).

Another way to control the GFX clock-voltage curve are parameters in `smc_pptable/qStaticVoltageOffset/0/c/{a,b,c}, which should default to 0. These are actually quadratic function curve parameters a, b and c, with which you can control the offset in a very precise way.

Yes I'm using power1_cap to set power limit and it works. Is there will be the same result if I set SocketPowerLimitAc/SocketPowerLimitDc equal to power1_cap?

Not on my RX5700, but it may be on your card. Let me know if it does, but if it does not it is a driver limitation, if not a bug.

nanom1t commented 3 years ago

How do you observe an actual power usage? Setting MaxVoltageGfx in tandem with smc_pptable/FreqTableGfx/1 should give you good control of upper voltage & clock limit (these two values combined together should define the max GFX power).

I have a power meter device in an electricity socket and I also get information from /sys/kernel/debug/dri/XXX/amdgpu_pm_info. Yes, I'm using smc_pptable/FreqTableGfx/1 for setting core clock and smc_pptable/MinVoltageGfx/smc_pptable/MaxVoltageGfx for core voltage. Looks like it works, but the core clock/voltage is not setting to exact value, which is specified. For example, if I set core clock to 1500 MHz and core voltage to 850 mV, then the core voltage will be 860mV in pp_od_clk_voltage. Probably Navi works in different way than Polaris.

Changing smc_pptable.FreqTableGfx.1 from 1900 to 1500 at 0x330
Changing smc_pptable.MinVoltageGfx from 2800 to 2400 at 0x24a
Changing smc_pptable.MaxVoltageGfx from 4600 to 3400 at 0x24e
Commiting changes to '/sys/class/drm/card0/device/pp_table'.
OD_SCLK:
0: 500Mhz
1: 1450Mhz
OD_MCLK:
1: 875MHz
OD_VDDC_CURVE:
0: 500MHz 720mV
1: 975MHz 727mV
2: 1450MHz 860mV
OD_RANGE:
SCLK:     500Mhz       2000Mhz
MCLK:     625Mhz        930Mhz
Changing smc_pptable.FreqTableGfx.1 from 1900 to 1900 at 0x330
Changing smc_pptable.MinVoltageGfx from 2800 to 2400 at 0x24a
Changing smc_pptable.MaxVoltageGfx from 4600 to 4400 at 0x24e
Commiting changes to '/sys/class/drm/card0/device/pp_table'.
OD_SCLK:
0: 500Mhz
1: 1835Mhz
OD_MCLK:
1: 875MHz
OD_VDDC_CURVE:
0: 500MHz 720mV
1: 1167MHz 762mV
2: 1835MHz 1080mV
nanom1t commented 3 years ago

Yes I'm using power1_cap to set power limit and it works. Is there will be the same result if I set SocketPowerLimitAc/SocketPowerLimitDc equal to power1_cap?

SocketPowerLimitAc/SocketPowerLimitDc params do not work for me too:

Changing smc_pptable.SocketPowerLimitAc.0 from 130 to 80 at 0x1ee
Changing smc_pptable.SocketPowerLimitDc.0 from 120 to 80 at 0x1fe
Commiting changes to '/sys/class/drm/card0/device/pp_table'.
GFX Clocks and Power:
    875 MHz (MCLK)
    1895 MHz (SCLK)
    1300 MHz (PSTATE_SCLK)
    625 MHz (PSTATE_MCLK)
    1087 mV (VDDGFX)
    **133.0 W (average GPU)**

Setting power1_cap works well.

sibradzic commented 3 years ago

Looks like it works, but the core clock/voltage is not setting to exact value, which is specified. For example, if I set core clock to 1500 MHz and core voltage to 850 mV, then the core voltage will be 860mV in pp_od_clk_voltage.

It could be that some voltage offset is set somewhere, maybe even in the hardware itself. On my Navi RX5700 the readings from amdgpu_pm_info are quite consistent with the smc_pptable/{Min,Max}VoltageGfx settings.

SocketPowerLimitAc/SocketPowerLimitDc params do not work for me too... Setting power1_cap works well.

OK, good to know, same as RX5700.

nanom1t commented 3 years ago

On my RX5700 these are controlled by smc_pptable/FreqTableUclk/{0,1,2,3}, but careful there, despite what people say, the memory controller on Navi cards would only accept very limited set of VRAM clock values. On my card most of the random clock values would crash the driver and freeze the system, but these (and possibly some others) would work: 430, 500, 525, 625, 700, 725... Perhaps changing VRAM clock works better on Navi2? I guess the only way to know for sure on which clocks your VRAM can run is a trial by error.

I've tried to use smc_pptable/FreqTableUclk/{0,1,2,3} for setting memory clock on my RX5500 XT, but it works very strange or I'm doing something wrong. According to pp_od_clk_voltage the memory clock range for my GPU is 625-930 Mhz:

OD_SCLK:
0: 500Mhz
1: 1500Mhz
OD_MCLK:
1: 875MHz
OD_VDDC_CURVE:
0: 500MHz 721mV
1: 1000MHz 732mV
2: 1500MHz 883mV
OD_RANGE:
SCLK:     500Mhz       2000Mhz
MCLK:     625Mhz        930Mhz

The current state for memory clock is 875MHz in pp_dpm_mclk:

0: 100Mhz 
1: 500Mhz 
2: 625Mhz 
3: 875Mhz *

UPP FreqTableUclk:

  FreqTableUclk:
    FreqTableUclk 0: 100
    FreqTableUclk 1: 500
    FreqTableUclk 2: 625
    FreqTableUclk 3: 875

So I tried to set 900 Mhz to smc_pptable/FreqTableUclk/3. It sets without any errors and I have OD_MCLK=900 Mhz now:

Changing smc_pptable.FreqTableGfx.1 from 1900 to 1500 at 0x330
Changing smc_pptable.FreqTableUclk.3 from 875 to 900 at 0x384
Commiting changes to '/sys/class/drm/card0/device/pp_table'.
OD_SCLK:
0: 500Mhz
1: 1500Mhz
OD_MCLK:
1: 900MHz
OD_VDDC_CURVE:
0: 500MHz 720mV
1: 1000MHz 730mV
2: 1500MHz 883mV
OD_RANGE:
SCLK:     500Mhz       2000Mhz
MCLK:     625Mhz        930Mhz

However pp_dpm_mclk shows that the GPU memory actually works with clock 625 MHz:

cat /sys/class/drm/card0/device/pp_dpm_mclk 
0: 100Mhz 
1: 500Mhz 
2: 625Mhz *
3: 900Mhz

Ok. I've tried to set smc_pptable.FreqTableUclk.2 from 625 to 900.

Changing smc_pptable.FreqTableGfx.1 from 1900 to 1500 at 0x330
Changing smc_pptable.FreqTableUclk.2 from 625 to 900 at 0x382
Changing smc_pptable.FreqTableUclk.3 from 875 to 900 at 0x384
Commiting changes to '/sys/class/drm/card0/device/pp_table'.
OD_SCLK:
0: 500Mhz
1: 1500Mhz
OD_MCLK:
1: 900MHz

Now memory clock works with 500MHz frequency:

cat /sys/class/drm/card0/device/pp_dpm_mclk 
0: 100Mhz 
1: 500Mhz *
2: 900Mhz 
3: 900Mhz

Setting smc_pptable.FreqTableUclk.1 from 500 to 900 shows the same result:

0: 100Mhz *
1: 900Mhz 
2: 900Mhz 
3: 900Mhz

Setting smc_pptable.FreqTableUclk.0 to 900 MHz just crashed the driver and I see artifacts on monitor. Setting memory clock to 925/930/625 MHz for all 4 stattes shows the same results.

What I'm doing wrong?

Thanks

sibradzic commented 3 years ago

What I'm doing wrong?

Probably nothing wrong at all. Have you checked if there is any sort of limit in the PP table limits section that defaults to 875? If not, it's probably some sort of driver or firmware limitation. Have you perhaps tried changing VRAM clock in Windows?

nanom1t commented 3 years ago

There are only these PP table values are equal to 875:

upp -p /sys/class/drm/card0/device/pp_table dump | grep 875

FreqTableUclk 3: 875
DcModeMaxFreq 2: 875
power_saving_clock:
  ...
  max:
    ...
    max 5: 875

I've tried to set power_saving_clock/max/5 and smc_pptable/DcModeMaxFreq/2 to 930, but it does not help.

upp -p /sys/class/drm/card0/device/pp_table set power_saving_clock/max/5=930 --write
Changing power_saving_clock.max.5 from 875 to 930 at 0x04a
Commiting changes to '/sys/class/drm/card0/device/pp_table'.

upp -p /sys/class/drm/card0/device/pp_table set smc_pptable/DcModeMaxFreq/2=930 --write
Changing smc_pptable.DcModeMaxFreq.2 from 875 to 930 at 0x40a
Commiting changes to '/sys/class/drm/card0/device/pp_table'.

upp -p /sys/class/drm/card0/device/pp_table set smc_pptable/FreqTableUclk/3=930 --write
Changing smc_pptable.FreqTableUclk.3 from 875 to 930 at 0x384
Commiting changes to '/sys/class/drm/card0/device/pp_table'.

Also I don't know why pp_od_clk_voltage does not work in my case. Looks like it should fix this.

Have you perhaps tried changing VRAM clock in Windows?

No. Will try to apply VRAM clock in Windows later.

wunderbar78 commented 3 years ago

I'm having the same VRAM issue with a RX 5700 XT. When overclocking the VRAM (1 or 100 MHz doesn't matter), the VRAM falls back into DPM2 state. Setting the PPTable in Windows works fine, I've used MorePowerTool to set the VRAM to 950 MHz without issues.

I'm running Linux Mint 20.1 (Ubuntu 20.04) with Mainline Kernel (5.12.4) and Mesa 21.1 RC5 (Mesa Almost Stable PPA)

At least I now have a way to run the VRAM below 875 MHz, otherwise it won't be downclocked with a 144Hz dual monitor setup... :) :|

sibradzic commented 3 years ago

hi @wunderbar78

You can actually use upp to check what PowerPlay data does the MorePowerTool modify, using --from-registry option:

pip3 install --user python-registry

# Assuming your windows installation is on /dev/sdb:
sudo mount /dev/sdb2 /mnt/
upp --from-registry /mnt/Windows/System32/config/SYSTEM dump

Now, you can use tools like Meld to visually compare the PowerPlay settings in your Linux system you get with upp dump. Please let me know what may be the PowerPlay table differences between unmodified Linux state and Windows when you "used MorePowerTool to set the VRAM to 950 MHz without issues" (please provide both dumps or unified diff)...

Another possibility is that the Linux kernel driver is blocking the MEM clock change for whatever reason. On my Ubuntu 21.04 (as well 20.04) & kernel 5.11 & Mesa 21.0+ I have no issue setting MEM clocks, but only certain clocks in certain states would work without crashing.

BTW, when you say "I've used MorePowerTool to set the VRAM to 950 MHz without issues", what does that really mean, which particular state MCLK state did you set to 950MHz in Windows?

wunderbar78 commented 3 years ago

Hello sibradzic! Thanks for the quick response! :)

I've booted Windows, reverted all my settings with MPT, loaded the running video bios into MPT and only changed the DPM3 VRAM clock to 950. Then I bootet Mint and dumped the running PPTable, the Windows PPTable and compared them.

Rigth after booting Mint, the SocketPowerLimitDc 0 is set to 180W, even though the VBIOS and Windows run with 190W. "SocketPowerLimitAc" is set to 190 which is correct.

Then there are 3 more differences in the files: FreqTableUclk 3: 875 DcModeMaxFreq 2: 875 and in the header of the PPTable there is power_saving_clock: > max: > max 5: 875

I can change the SocketPowerLimitDc, FreqTableUclk 3 and DcModeMaxFreq 2 to match the Windows settings, but I don't know how to adjust the power_saving_clock. Except the power_saving_clock, the settings are the same now and the VRAM clock is nailed to DPM2 @ 625MHz

Here are the dumps from Windows, after booting Linux and after setting the mentioned values with UPP...

upp-ppt-windows-mpt-dump.txt upp-ppt-linux-after-boot.txt upp-ppt-linux-ram950-upp-cmdline.txt

and here is a screenshot of MPT and GPU-Z. The only thing I've changed is Memory DPM 3 grafik and the saved settings from MorePowerTool as a .mpt file in case you want a look at it... MPT_DPM3_MEM950.mpt.tar.gz

Gonna test different kernels now...

sibradzic commented 3 years ago

Here is the PowerPlay table diff between your Linux default one and the MPT mods, only 4 changes:

diff -u upp-ppt-linux-after-boot.txt upp-ppt-windows-mpt-dump.txt
--- upp-ppt-linux-after-boot.txt    2021-05-22 21:05:19.393891542 +0900
+++ upp-ppt-windows-mpt-dump.txt    2021-05-22 21:05:13.013872145 +0900
@@ -1,3 +1,6 @@
+Successfully loaded Soft PowerPlay data from /media/user/Windows/Windows/System32/config/SYSTEM
+  key:value > HKLM\SYSTEM\ControlSet001\Control\Class\{4d36e968-e325-11ce-bfc1-08002be10318}\0000:PP_PhmSoftPowerPlayTable
+
 header:
   structuresize: 1674
   format_revision: 12
@@ -35,7 +38,7 @@
     max 2: 1086
     max 3: 1267
     max 4: 1267
-    max 5: 875
+    max 5: 950
     max 6: 1267
     max 7: 1284
     max 8: 1284
@@ -186,7 +189,7 @@
     SocketPowerLimitAcTau 2: 0
     SocketPowerLimitAcTau 3: 0
   SocketPowerLimitDc:
-    SocketPowerLimitDc 0: 180
+    SocketPowerLimitDc 0: 190
     SocketPowerLimitDc 1: 0
     SocketPowerLimitDc 2: 0
     SocketPowerLimitDc 3: 0
@@ -390,7 +393,7 @@
     FreqTableUclk 0: 100
     FreqTableUclk 1: 500
     FreqTableUclk 2: 625
-    FreqTableUclk 3: 875
+    FreqTableUclk 3: 950
   FreqTableDcefclk:
     FreqTableDcefclk 0: 507
     FreqTableDcefclk 1: 1267
@@ -447,7 +450,7 @@
   DcModeMaxFreq:
     DcModeMaxFreq 0: 2100
     DcModeMaxFreq 1: 1267
-    DcModeMaxFreq 2: 875
+    DcModeMaxFreq 2: 950
     DcModeMaxFreq 3: 1086
     DcModeMaxFreq 4: 1267
     DcModeMaxFreq 5: 1267

You should be able to get the same MEM clock as on Windoze with this command:

sudo -E env "PATH=$PATH" upp set        \
  power_saving_clock/max/5=950          \
  smc_pptable/SocketPowerLimitDc/0=190  \
  smc_pptable/FreqTableUclk/3=950       \
  smc_pptable/DcModeMaxFreq/2=950       \
  --write

Note that SocketPowerLimitDc/0 alone is not enough to raise the power limit in kernel driver, you need to change the limit using sysfs interface as well:

CARD_ID=card0
HWMON_ID=$(ls -1 /sys/class/drm/${CARD_ID}/device/hwmon)
echo 190000000 | sudo tee /sys/class/hwmon/${HWMON_ID}/power1_cap

This should all work in Mint, especially on kernel 5.11 which is supposed to be based on Ubuntu's config...

wunderbar78 commented 3 years ago

It's working... somehow... but it's a little weird.

After boot, I start a terminal to monitor the GPU with watch cat /sys/kernel/debug/dri/0/amdgpu_pm_info and another one to run this script:

#!/bin/bash
CARD_ID=card0
HWMON_ID=$(ls -1 /sys/class/drm/${CARD_ID}/device/hwmon)
echo 190000000 | sudo tee /sys/class/hwmon/${HWMON_ID}/power1_cap

sudo -E env "PATH=$PATH" upp set power_saving_clock/max/5=900 smc_pptable/SocketPowerLimitDc/0=190 smc_pptable/FreqTableUclk/3=900 smc_pptable/DcModeMaxFreq/2=900 --write

Result: VRAM stuck at DPM2 btw thanks for the hint with the PowerLimit! :)

then I force the card to a high Performance Level echo "high" | sudo tee /sys/class/drm/card0/device/power_dpm_force_performance_level no change. Then I tried manual and different power profiles (/sys/class/drm/card0/device/pp_power_profile_mode) , but nothing happened. (except that when running at 625MHz, I have to use manual and powersave, otherwise I have artifacts cause of the low clock speed or power level)

Then I start CoreCtrl with nothing activated and then magically the VRAM clock is at 900MHz. All settings/profiles are disabled and set to Auto grafik

No idea what magic CoreCtrl does here, but it's working this way. It doesn't matter if I first set the Performance Level or start CoreCtrl. Setting CoreCtrl > Global Profile to Fixed High Performance Level works too. But in the end I have to do both (CoreCtrl & Perf.Level) to get the desired VRAM clock. (btw when I now overclock in CoreCtrl above the speed I set in the script initially, the clock drops down to 625 again)

For now I can run the script and launch CoreCtrl, no worries. But maybe you have an idea what I can do to get it working with a script!

Many thanks!

PS: Not sure if my issue is still related to OP. Please let me know if I should have made my own topic, for the next time...

sibradzic commented 3 years ago

Something is definitely strange with your setup, could be firmware (which linux-firmware version are you using?) or even card's VBIOS (are you using a stock one?). Do you have amdgpu.ppfeaturemask=0xffffffff set in the kernel boot params? Is RX5700XT the only GPU in your system, no iGPUs and such?

I have no idea what CoreCtrl is doing to your card, never tried that one, perhaps you can check some of its logs or source itself... Under "normal" circumstances (recent upstream kernel driver, recent firmware, stock VBIOS) MEM clocks DO CHANGE with just applying changes to the PowerPlay table, the driver picks up the changes on runtime and driver's PM applies them just fine.

If you really want to get to the bottom of this I suggest you go 100% stock, VBIOS included, with say Ubuntu 21.04 which has all the most recent agmdgpu bits out-of-the-box, no amdgpu-pro or similar drivers, no overclocking tools of any sort in the system, and just fire the upp command I posted above and report back.

wunderbar78 commented 3 years ago

Yes, that's what I've thought, too. It's not the freshest installation and probably a good idea to start over. Maybe I'll look at some other distros (not Ubuntu) that are more up to date out of the box. I've tried several gpu tools with this install and I think I didn't reinstall when I switched from Nvidia to AMD, and that was with Mint 19.x or so. So the installation can definitely be messed up.... To answer your questions: I use the featuremask 0xfffd7fff in grub, but have temporarily changed it to 0xffffff without anything changing. Otherwise there is nothing special in the kernel parameters: quiet loglevel=2 iommu=soft amdgpu.ppfeaturemask=0xFFFD7FFF amdgpu.audio=0 mitigations=off.

The VBios is the default OC bios of my PowerColor Red Dragon and I only tinkered with the silent bios, so the bios used is untouched. The Linux firmware is 1.190-2 from the Mesa Almost Stable PPA and I tried 2 other firmwares available in the repos but they are too old and it didn't work well.... i never used the amdgpu-pro driver from AMD.

I'll let you know how it goes with a fresh install, but no idea when that will happen. Cheers mate!

wunderbar78 commented 3 years ago

I'm back with Ubuntu Cinnamon Remix 21.04 and I'm not happy to tell that it still doesn't work (as intended).

I've tested the default 5.11.0-17 Ubuntu kernel and Mesa 21.0.1 from the official repo. I reflashed the powersave VBIOS from TechPowerUp database, loaded the default Bios settings, turned rBAR on and off, PCIe Gen 3 and 4, tried the OC VBIOS again and the 5.12 Mainline kernel. The kernel parameter is stripped down to amdgpu.ppfeaturemask=0xFFFFFFFF. No other tools have been installed. The only difference I see is when using the Mainline kernel; the SCLK clock is not going up when setting the performance level to high. But it's working with the normal Ubuntu kernel!

the latest command I used:

sudo -E env "PATH=$PATH" upp set  --write \
smc_pptable/FreqTableUclk/3=900 \
smc_pptable/DcModeMaxFreq/2=900 \
power_saving_clock/max/5=900 \
smc_pptable/SocketPowerLimitDc/0=190 \
smc_pptable/MaxVoltageGfx=4700 \
power_saving_clock/max/0=2050 \
smc_pptable/DcModeMaxFreq/0=2050 \
smc_pptable/SocketPowerLimitAc/0=190 \
overdrive_table/cap/9=1

I added overdrive_table/cap/9=1 cause it was different between OC and powersave VBIOS. I also tried ordering the different parameters/values and/or applying them one by one with a script instead of one long command. Getting Windows vibes here...

Then I had a look at /var/log/kern.log (better late than never) and it shows the following:

May 29 17:21:11 unit-01 kernel: [ 5818.225297] amdgpu 0000:2d:00.0: amdgpu: smu driver if version = 0x00000036, smu fw if version = 0x00000037, smu fw version = 0x002a3e00 (42.62.0)
May 29 17:21:11 unit-01 kernel: [ 5818.225301] amdgpu 0000:2d:00.0: amdgpu: SMU driver if version not matched
May 29 17:21:11 unit-01 kernel: [ 5818.225353] amdgpu 0000:2d:00.0: amdgpu: use vbios provided pptable
May 29 17:21:11 unit-01 kernel: [ 5818.225354] amdgpu 0000:2d:00.0: amdgpu: smc_dpm_info table revision(format.content): 4.5
May 29 17:21:11 unit-01 kernel: [ 5818.227724] amdgpu 0000:2d:00.0: amdgpu: SMU is initialized successfully!

When I see the 3rd line, I guess there is nothing applied at all.

Now I tried Mesa Almost Stable PPA and the Oibaf PPA, but nothing, and reverted to Ubuntu Mesa. (At this point I wasn't sure where the smu driver comes from) Then I looked for linux-firmware, as the above smu driver version mismatch was still present. First try was with linux-firmware-187, but now the smu fw version was too old. linux-firmware-190 was a match, but the only thing that disappeared was the mismatch notice. It still says amdgpu: use vbios provided pptable. Anotherother try with Mesa Almost Stable was also not successful.

Searching around the web got me the idea of unplugging one of my monitors, cause people had problems with Dual-DisplayPort setups, but no. Right now I'm writing on a single HDMI connection with 60Hz.

The last interesting bit I found is this amdgpu kernel parameter, but not sure how to apply a pptable manually. I've set the parameter to 1 and tested, but that's not how it works. Not sure where to place a file or whatever that I specify with the id. https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html

smu_pptable_id (int)
Used to override pptable id. id = 0 use VBIOS pptable. id > 0 use the soft pptable with specicfied id.

So for now, I'm a little clueless and done. No idea what to do next except to hunt some pixels to relax...

sibradzic commented 3 years ago

OK, I've ran some test again on my Ubuntu 21.04, stock 5.11 kernel, stock linux-firmware 1.197, kisak-mesa-fresh (not really relevant for this issue)...

I've never ever tried to overclock my VRAM from default 875MHz limit, because I'm running my 5700 completely fan-less, no point in overclocking, so upp worked for me just fine when I tried going below default VRAM clock limits. I suggest you try it out, for the sake of the "science", if it works for you too it means there is absolutely nothing wrong with upp, it sets the PowerPlay values just fine.

My system boots with simple defaults; I've disconnected my secondary monitor just in case, everything else is at stock, other than ppfeaturemask:

# everything at defaults at boot time
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.11.0-17-generic root=UUID=143ecccc-4545-4fc3-ac85-2cc91134ffff ro amdgpu.ppfeaturemask=0xffffffff
 . . .
[    7.477960] amdgpu 0000:0c:00.0: amdgpu: smu driver if version = 0x00000036, smu fw if version = 0x00000037, smu fw version = 0x002a3e00 (42.62.0)
[    7.477962] amdgpu 0000:0c:00.0: amdgpu: SMU driver if version not matched
[    7.478016] amdgpu 0000:0c:00.0: amdgpu: use vbios provided pptable
[    7.478017] amdgpu 0000:0c:00.0: amdgpu: smc_dpm_info table revision(format.content): 4.5
[    7.512794] amdgpu 0000:0c:00.0: amdgpu: SMU is initialized successfully!

This is just to verify the default states:

# Get the default PP table values:
upp get        \
  power_saving_clock/max/5          \
  smc_pptable/SocketPowerLimitDc/0  \
  smc_pptable/FreqTableUclk/3       \
  smc_pptable/DcModeMaxFreq/2
875
100
875
875

# root terminal:
cat /sys/class/drm/card0/device/pp_dpm_mclk
0: 100Mhz *
1: 500Mhz 
2: 625Mhz 
3: 875Mhz

To get to the bottom of this I've tried setting the VRAM clock above default, and I think I could reproduce your issue by trying to set the VRAM clocks to 900:

sudo -E env "PATH=$PATH" upp set        \
  power_saving_clock/max/5=900         \
  smc_pptable/SocketPowerLimitDc/0=120  \
  smc_pptable/FreqTableUclk/3=900       \
  smc_pptable/DcModeMaxFreq/2=900       \
  --write
[sudo] password for user: 
Changing power_saving_clock.max.5 from 875 to 900 at 0x04a
Changing smc_pptable.SocketPowerLimitDc.0 from 100 to 120 at 0x1fe
Changing smc_pptable.FreqTableUclk.3 from 875 to 900 at 0x384
Changing smc_pptable.DcModeMaxFreq.2 from 875 to 900 at 0x40a
Commiting changes to '/sys/class/drm/card0/device/pp_table'.

# root terminal:
dmesg | tail | grep amdgpu
[  163.033552] amdgpu 0000:0c:00.0: amdgpu: smu driver if version = 0x00000036, smu fw if version = 0x00000037, smu fw version = 0x002a3e00 (42.62.0)
[  163.033557] amdgpu 0000:0c:00.0: amdgpu: SMU driver if version not matched
[  163.033616] amdgpu 0000:0c:00.0: amdgpu: use vbios provided pptable
[  163.033617] amdgpu 0000:0c:00.0: amdgpu: smc_dpm_info table revision(format.content): 4.5
[  163.035860] amdgpu 0000:0c:00.0: amdgpu: SMU is initialized successfully!
cat /sys/class/drm/card0/device/pp_dpm_mclk
0: 100Mhz *
1: 500Mhz 
2: 625Mhz 
3: 900Mhz

This means that upp sets the PP table state just fine, the above "state 3" correctly shows 900Mhz. Do not worry about use vbios provided pptable in dmesg, it is misleading, the pptable is actually loaded from updated /sys/class/drm/card0/device/pp_table. Note that on Radeon7/Navi/Navi2 cards, there aren't really discrete clock states like on Polaris, the pp_dpm_mclk just represents the clock limits as "states", even when * is as some numbered state it means that an actual clock is approximately around the level shown in the value behind the number.

Now I started some GPU load in another terminal with ./start_furmark_windowed_1024x640.sh, and:

cat /sys/class/drm/card0/device/pp_dpm_mclk
0: 100Mhz 
1: 500Mhz 
2: 625Mhz *
3: 900Mhz

As you can see, pp_dpm_mclk reports 625MHz, but in reality the VRAM clock is probably at default 875MHz (I wonder if there is a better way to observe that actual VRAM clock than pp_dpm_mclk interface). Forcing power_dpm_force_performance_level to high did not change anything at all:

echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level
cat /sys/class/drm/card0/device/pp_dpm_mclk
0: 100Mhz 
1: 500Mhz 
2: 625Mhz *
3: 900Mhz

However, "committing" to completely different pp_od_clk_voltage driver interface (in spite that there are no actual changes committed on that interface!) after changing the VRAM clock using PP table seems to do the trick:

echo c > /sys/class/drm/card0/device/pp_od_clk_voltage
cat /sys/class/drm/card0/device/pp_dpm_mclk
0: 100Mhz 
1: 500Mhz 
2: 625Mhz 
3: 900Mhz *

This is totally weird, and very likely to be blamed on an actual amdgpu driver bug (please report it if you feel adventurous), and is the likely reason why things kind of worked for you when you fired up the CoreCtl after changing the PP tables with upp. For most of PP table modifications this extra "commit" is not really necessary, one can change GFX clocks, voltages, limits, etc by simply writing into /sys/class/drm/card0/device/pp_table, the driver will detect the changes and its power subsystem would apply the changes on runtime. However, when increasing the VRAM clocks above the default max limit this reload logic would miss to increase the max clock state for some reason, and it probably happens on both Navi and Navi 2. A similar issue exists when changing the socket power limits (such as SocketPowerLimitDc) using PP tables, the driver does not consider that to be enough, it also requires a change in /sys/class/hwmon/${HWMON_ID}/power1_cap.

@nanom1t @wunderbar78 Although the Linux open-source AMD driver stack (kernel, drm, amdgpu, mesa, RADV) is pretty awesome, there are always little quirks like this, but none of them has anything to do with upp, there is no issue I can fix in this project. So please consider contributing findings in these comments as a "Navi(2) overclocking tips & quirks guide" readme file, because I feel like closing this "issue" at this point.

sibradzic commented 3 years ago

@azeam check the comment above, may be relevant for your project...

wunderbar78 commented 3 years ago

Hey this does the trick! Great! echo c > /sys/class/drm/card0/device/pp_od_clk_voltage Trying around and finding a way to automate everything without entering passwords I had a look for the documentation of the command and found this in the kernel documentation. https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html

**pp_table**
The amdgpu driver provides a sysfs API for uploading new powerplay tables. The file pp_table is used for this. Reading the file will dump the current power play table. Writing to the file will attempt to upload a new powerplay table and re-initialize powerplay using that new table.

**pp_od_clk_voltage**
The amdgpu driver provides a sysfs API for adjusting the clocks and voltages in each power level within a power state. The pp_od_clk_voltage is used for this.
[...]
_When you have edited all of the states as needed, write “c” (commit) to the file to commit your changes
If you want to reset to the default power levels, write “r” (reset) to the file to reset them_

So, as far as I understand that, it is the intended behaviour when playing with the clock speeds and power. If you only play around with the pptable (and stay within the limits?), you don't need to commit the changes to sysfs. This makes kind of sense to me with what I've seen the last day and maybe explains why corectrl applied the values, cause it's commiting the changes. I had a look at the repo, but no idea where to start to confirm that. (not a dev) But it works, I know why and I can absolutely live with how it is now. If you still think it's a bug, let me know and maybe give me a hint where the best place to report is?!

Finally, for my solution, I've done the following: I created a script somewhere in my home partition, made an exception for the script and upp in the sudoers file and run it at user login with sudo.

#!/bin/bash

upp set --write smc_pptable/FreqTableUclk/3=900 \
smc_pptable/DcModeMaxFreq/2=900 \
power_saving_clock/max/5=900 \
smc_pptable/SocketPowerLimitDc/0=190 \
smc_pptable/MaxVoltageGfx=4700 \
power_saving_clock/max/0=2000 \
smc_pptable/DcModeMaxFreq/0=2000 \
smc_pptable/SocketPowerLimitAc/0=190 \
overdrive_table/cap/9=1

CARD_ID=card0
HWMON_ID=$(ls -1 /sys/class/drm/${CARD_ID}/device/hwmon)
echo 190000000 > /sys/class/hwmon/${HWMON_ID}/power1_cap
echo c > /sys/class/drm/${CARD_ID}/device/pp_od_clk_voltage

MCLK goes down to 625 when idle and on load it goes up to 900. There is some flickering here and there with 625MHz MCLK, but for now it's ok and I prefer the -10W power cosumption on idle. As long as I don't touch power_dpm_force_performance_level, everthing works as it should. As soon as I touch it, gamemoderun is not able to set it to a lower state. I think I'm gonna tell Feral... The sudoers solution is probably not best practice, but it's ok for me. Any suggestions are welcome!

Thanks mate, have a good one!

azeam commented 3 years ago

@azeam check the comment above, may be relevant for your project...

Great! I will do some testing as well when I get back home, should be easy to implement and hopefully make the memory clock easier to set.

chraac commented 3 years ago

Hey this does the trick! Great! echo c > /sys/class/drm/card0/device/pp_od_clk_voltage Trying around and finding a way to automate everything without entering passwords I had a look for the documentation of the command and found this in the kernel documentation. https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html

**pp_table**
The amdgpu driver provides a sysfs API for uploading new powerplay tables. The file pp_table is used for this. Reading the file will dump the current power play table. Writing to the file will attempt to upload a new powerplay table and re-initialize powerplay using that new table.

**pp_od_clk_voltage**
The amdgpu driver provides a sysfs API for adjusting the clocks and voltages in each power level within a power state. The pp_od_clk_voltage is used for this.
[...]
_When you have edited all of the states as needed, write “c” (commit) to the file to commit your changes
If you want to reset to the default power levels, write “r” (reset) to the file to reset them_

So, as far as I understand that, it is the intended behaviour when playing with the clock speeds and power. If you only play around with the pptable (and stay within the limits?), you don't need to commit the changes to sysfs. This makes kind of sense to me with what I've seen the last day and maybe explains why corectrl applied the values, cause it's commiting the changes. I had a look at the repo, but no idea where to start to confirm that. (not a dev) But it works, I know why and I can absolutely live with how it is now. If you still think it's a bug, let me know and maybe give me a hint where the best place to report is?!

Finally, for my solution, I've done the following: I created a script somewhere in my home partition, made an exception for the script and upp in the sudoers file and run it at user login with sudo.

#!/bin/bash

upp set --write smc_pptable/FreqTableUclk/3=900 \
smc_pptable/DcModeMaxFreq/2=900 \
power_saving_clock/max/5=900 \
smc_pptable/SocketPowerLimitDc/0=190 \
smc_pptable/MaxVoltageGfx=4700 \
power_saving_clock/max/0=2000 \
smc_pptable/DcModeMaxFreq/0=2000 \
smc_pptable/SocketPowerLimitAc/0=190 \
overdrive_table/cap/9=1

CARD_ID=card0
HWMON_ID=$(ls -1 /sys/class/drm/${CARD_ID}/device/hwmon)
echo 190000000 > /sys/class/hwmon/${HWMON_ID}/power1_cap
echo c > /sys/class/drm/${CARD_ID}/device/pp_od_clk_voltage

MCLK goes down to 625 when idle and on load it goes up to 900. There is some flickering here and there with 625MHz MCLK, but for now it's ok and I prefer the -10W power cosumption on idle. As long as I don't touch power_dpm_force_performance_level, everthing works as it should. As soon as I touch it, gamemoderun is not able to set it to a lower state. I think I'm gonna tell Feral... The sudoers solution is probably not best practice, but it's ok for me. Any suggestions are welcome!

Thanks mate, have a good one!

I have test on my 6900xt, this trick not work! When I apply the memory frequency of FreqTableUclk/3 to 1050, it'll bump to FreqTableUclk/2. And it shows permission denied when I write "c" to pp_od_clk_voltage, even with root. I want to know if there is any other way to apply the mem clock, please help, thanks! OS: Ubuntu 20.04 TLS with 21.10 driver HW: AMD Radeon 6900XT

sibradzic commented 3 years ago

And it shows permission denied when I write "c" to pp_od_clk_voltage, even with root.

You are probably doing something wrong. Please revert to upstream driver (6900XT should be pretty sable on ppa kernel 5.11 or 5.12), make sure ppfeaturemask is set correctly and report back.

chraac commented 3 years ago

And it shows permission denied when I write "c" to pp_od_clk_voltage, even with root.

You are probably doing something wrong. Please revert to upstream driver (6900XT should be pretty sable on ppa kernel 5.11 or 5.12), make sure ppfeaturemask is set correctly and report back.

Thanks for replying, the ppfeaturemask was set and applied to the kernel boot parameter. The core clock set by upp works well, just the mem clock looks abnormal. The kernel version is 5.4, I don't know if it's the root cause of that problem.

chraac commented 3 years ago

And it shows permission denied when I write "c" to pp_od_clk_voltage, even with root.

You are probably doing something wrong. Please revert to upstream driver (6900XT should be pretty sable on ppa kernel 5.11 or 5.12), make sure ppfeaturemask is set correctly and report back.

Looks the memory overclock of RX6800/6900 is only available at kernel 5.12 and above, have you tried to set the mem clock in linux 5.11? I upgrade the kernel to hwe 5.11, still not working correctly.

Ref: https://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next&id=37a58f691551dfdff4f1035ee119c9ebdb9eb119

sibradzic commented 3 years ago

have you tried to set the mem clock in linux 5.11?

Yes I have, on RX5700, that is. Unless you lend me a RX6800/6900 to do the testing myself, there is nothing I can do for you, this is clearly a driver issue, just switch to the latest stable mainline kernel, or Ubuntu 21.10 which has kernel 5.13 by default.