sibradzic / amdgpu-clocks

Simple script to control power states of amdgpu driven GPUs
GNU General Public License v2.0
393 stars 44 forks source link

Vega 56 voltages might not apply #5

Closed FKleinebreil closed 4 years ago

FKleinebreil commented 4 years ago

I'm using an Asus Vega 56 Strix with Ubuntu Budgie 19.04. I wanted to use your tool (which is nevertheless great btw!) to undervolt the card and overclock the memory. I have the suspicion that while clocks apply, voltages don't.

My custom power states (core clocks as they were + custom voltages (1.2V -> 1.0V for P7), overclocked P3 memory):

_OD_SCLK: 0: 852Mhz 800mV 1: 991Mhz 900mV 2: 1138Mhz 906mV 3: 1269Mhz 912mV 4: 1312Mhz 918mV 5: 1474Mhz 975mV 6: 1538Mhz 987mV 7: 1590Mhz 1000mV OD_MCLK: 0: 167Mhz 800mV 1: 500Mhz 800mV 2: 700Mhz 900mV 3: 940Mhz 975mV ODRANGE: SCLK: 852MHz 2400MHz MCLK: 167MHz 1500MHz VDDC: 800mV 1000mV

After running the script _/sys/class/drm/cardX/device/pp_od_clkvoltage is identical to /etc/default/amdgpu-custom-state.card0 except:

VDDC: 800mV 1200mV

1200mV was the default value, too. When I use WattmanGTK to monitor the Vega 56 during Unigine Superposition the reported vddgfx is 1.2V. Also the GPU temperature hits 80°C, which should not happen if 1.0V were actually applied. The memory clock is read correctly at 940MHz.

Do you have any idea what is going on? Many thanks in advance!

sibradzic commented 4 years ago

Hi there @FKleinebreil. Can you please share:

  1. output of uname -a
  2. output of sudo cat /sys/class/drm/card0/device/pp_od_clk_voltage before you apply any changes
  3. output of sudo cat /etc/default/amdgpu-custom-states.card0
  4. output of sudo amdgpu-clocks, i.e. when you apply change
  5. output of sudo cat /sys/class/drm/card0/device/pp_od_clk_voltage after you apply any changes

(please try to wrap these outputs in "Insert code" when adding comments, for clarity)

BTW, you should not worry too much about OD_RANGE: outputs, such as VDDC: 800mV 1200mV, these only show possible MCLK voltage range, which should not change at all unless you mess up with FORCE_POWER_CAP.

FKleinebreil commented 4 years ago

Thank you for the fast reply! I removed the OD_RANGE: constraints from amdgpu-custom-states.card0.

1. Linux XXXXX 5.0.0-31-generic #33-Ubuntu SMP Mon Sep 30 18:51:59 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

2.

OD_SCLK:
0:        852Mhz        800mV
1:        991Mhz        900mV
2:       1138Mhz        950mV
3:       1269Mhz       1000mV
4:       1312Mhz       1050mV
5:       1474Mhz       1100mV
6:       1538Mhz       1150mV
7:       1590Mhz       1200mV
OD_MCLK:
0:        167Mhz        800mV
1:        500Mhz        800mV
2:        700Mhz        900mV
3:        800Mhz        950mV
OD_RANGE:
SCLK:     852MHz       2400MHz
MCLK:     167MHz       1500MHz
VDDC:     800mV        1200mV

3.

OD_SCLK:
0:        852Mhz        800mV
1:        991Mhz        900mV
2:       1138Mhz        906mV
3:       1269Mhz       912mV
4:       1312Mhz       918mV
5:       1474Mhz       975mV
6:       1538Mhz       987mV
7:       1590Mhz       1000mV
OD_MCLK:
0:        167Mhz        800mV
1:        500Mhz        800mV
2:        700Mhz        900mV
3:        940Mhz        975mV

4.

Detecting the state values at /sys/class/drm/card0/device/pp_od_clk_voltage:
  SCLK state 0: 852Mhz, 800mV
  SCLK state 1: 991Mhz, 900mV
  SCLK state 2: 1138Mhz, 950mV
  SCLK state 3: 1269Mhz, 1000mV
  SCLK state 4: 1312Mhz, 1050mV
  SCLK state 5: 1474Mhz, 1100mV
  SCLK state 6: 1538Mhz, 1150mV
  SCLK state 7: 1590Mhz, 1200mV
  MCLK state 0: 167Mhz, 800mV
  MCLK state 1: 500Mhz, 800mV
  MCLK state 2: 700Mhz, 900mV
  MCLK state 3: 800Mhz, 950mV
  Maximum clocks & voltages:
    SCLK clock 2400MHz
    MCLK clock 1500MHz
    VDDC voltage 1200mV
  Curent power cap: 260W
Verifying user state values at /etc/default/amdgpu-custom-state.card0:
  SCLK state 0: 852Mhz, 800mV
  SCLK state 1: 991Mhz, 900mV
  SCLK state 2: 1138Mhz, 906mV
  SCLK state 3: 1269Mhz, 912mV
  SCLK state 4: 1312Mhz, 918mV
  SCLK state 5: 1474Mhz, 975mV
  SCLK state 6: 1538Mhz, 987mV
  SCLK state 7: 1590Mhz, 1000mV
  MCLK state 0: 167Mhz, 800mV
  MCLK state 1: 500Mhz, 800mV
  MCLK state 2: 700Mhz, 900mV
  MCLK state 3: 940Mhz, 975mV
Committing custom states to /sys/class/drm/card0/device/pp_od_clk_voltage:
  Done
  1. OD_SCLK:
    0:        852Mhz        800mV
    1:        991Mhz        900mV
    2:       1138Mhz        906mV
    3:       1269Mhz        912mV
    4:       1312Mhz        918mV
    5:       1474Mhz        975mV
    6:       1538Mhz        987mV
    7:       1590Mhz       1000mV
    OD_MCLK:
    0:        167Mhz        800mV
    1:        500Mhz        800mV
    2:        700Mhz        900mV
    3:        940Mhz        975mV
    OD_RANGE:
    SCLK:     852MHz       2400MHz
    MCLK:     167MHz       1500MHz
    VDDC:     800mV        1200mV

However, WattmanGTK still reports a vddgfx (which I assume is core voltage) of 1.2V and temps reach 80°C under load quickly, as I said. Or to say it differently: Temps are identical before and after after the change.

sibradzic commented 4 years ago

However, WattmanGTK still reports a vddgfx (which I assume is core voltage) of 1.2V and temps reach 80°C under load quickly, as I said. Or to say it differently: Temps are identical before and after after the change.

I have no clue about WattmanGTK, just to be sure it is sane, compare the output of (as root) watch -n1 "cat /sys/kernel/debug/dri/0/amdgpu_pm_info | tail -n16" (keep the terminal "always on top" and compare SCLK, MCLK, voltage, wattage & temperature outpus)

FKleinebreil commented 4 years ago

Sorry, this is my first time using the github forum. There were proper new lines before the insert code and I don't know how to add newlines in that environment. I'll google it and edit it.

I cross checked with the command you gave me and it too reports a VDDGFX of 1200mV, while the changed memory clock (940 MHz) seems to be applied correctly.

Typical output:

GFX Clocks and Power:
        940 MHz (MCLK)
        1591 MHz (SCLK)
        1269 MHz (PSTATE_SCLK)
        700 MHz (PSTATE_MCLK)
        1200 mV (VDDGFX)
        169.0 W (average GPU)

GPU Temperature: 60 C
GPU Load: 99 %
sibradzic commented 4 years ago

Thanks for the output fix! According to the output, noting wrong with the script itself, the clocks and voltages are being set correctly. As of the VDDGFX report, it is indeed strange, it could be related to requirement to set the clock and voltages in order (see https://github.com/RadeonOpenCompute/ROCm/issues/463 for details). Perhaps you can test applying the following /etc/default/amdgpu-custom-states.card0:

OD_SCLK:
7:       1590Mhz       1170mV
OD_MCLK:
3:        800Mhz        930mV
FORCE_SCLK: 7
FORCE_MCLK: 1
FORCE_PERF_LEVEL: manual

reboot before you try, and check the output of cat /sys/kernel/debug/dri/0/amdgpu_pm_info | tail -n16 before and after you apply the custom states (no need to generate GPU load, both SCLK & MCLK states are being forced here) and let me know...

Also, as I don't have Vega to test the thing, but I know that Polaris for example have some strange limitations when it comes to relation between state and memory voltages (highest voltage of current SCLK and MCLK will "prevail" as VDDGFX), Vega 56/64 may also have some weird things to consider, so try with simpler custom states first (changing just one state at a time) and see what result you end up with.

FKleinebreil commented 4 years ago

I edited /etc/default/amdgpu-custom-states.card0 as you suggested, rebooted and applied the changes. This is amdgpu_pm_info:

GFX Clocks and Power:
    800 MHz (MCLK)
    1633 MHz (SCLK)
    1269 MHz (PSTATE_SCLK)
    700 MHz (PSTATE_MCLK)
    1200 mV (VDDGFX)
    37.0 W (average GPU)

GPU Temperature: 48 C
GPU Load: 0 %

SMC Feature Mask: 0x000000001ba1fb4f
UVD: Disabled

VCE: Disabled

So the issue remains. However, I get that the issue is not with your script but somewhere else. So there is probably not much you can do about it. I will see if I can find someone with similar issues.

Thank you very much for your time!

sibradzic commented 4 years ago

If you really applied that custom state I suggested correctly, that output of amdgpu_pm_info is super strange indeed. Are you sure you don't have some other over/under/clocking/volting software active at the same time? Do you perhaps have multiple graphic cards in your system?

FKleinebreil commented 4 years ago

I don't have other software active and only a single Vega 56. No iGPU either, CPU is a Ryzen 5 3600.

However, I consider the issue closed as it's mostly likely not due to your script.

sibradzic commented 4 years ago

This seems to be the bug affecting you, and apparently the fix was just applied to 5.4 RC: https://bugs.freedesktop.org/show_bug.cgi?id=109887 https://bugzilla.kernel.org/show_bug.cgi?id=205277

FKleinebreil commented 4 years ago

Thank you very much for the Update!

xcom169 commented 2 years ago

I have 5.15.28 kernel, but I think I have the same issue with Vega56