sibradzic / amdgpu-clocks

Simple script to control power states of amdgpu driven GPUs
GNU General Public License v2.0
390 stars 43 forks source link

5700 XT states reverting to default after application #37

Closed Halornek closed 2 years ago

Halornek commented 2 years ago

Thanks for making this software.

I'm attempting to apply some undervolt settings to my 5700 XT running in a headless install of Ubuntu Server 20.04.3 LTS.

I'm currently running the base kernel, 5.4.0-89-generic with AMDGPU open driver version 20.45.

I set up the amdgpu.ppfeaturemask=0xffffffff under GRUB, and set up both the amdgpu-clocks script and the amdgpu-custom-states file under /etc/default.

# For Navi (and Radeon7) we can only set highest SCLK & MCLK, "state 1":
OD_SCLK:
0: 800Mhz
1: 1300MHz

# Set Mem Clock
OD_MCLK:
1: 875MHz

# Set Voltage Offset (Navi2 with AMDGPU 21.**)
# OD_VDDGFX_OFFSET:
# -300mV

# More fine-grain control of clocks and voltages are done with VDDC curve:
OD_VDDC_CURVE:
0: 800MHz 750mV
1: 1000MHz 775mV
2: 1300MHz 800mV

# Force power limit (in micro watts):
FORCE_POWER_CAP: 165000000

# Force performance level:
FORCE_PERF_LEVEL: manual

(OD_VDDGFX_OFFSET is commented out as it was a reference from the original file I used for my 6900 XT on my desktop)

Before running the script, my pp_od_clk_voltage file outputs this when I run cat:

OD_SCLK:
0: 800Mhz
1: 2079Mhz
OD_MCLK:
1: 875MHz
OD_VDDC_CURVE:
0: 800MHz 706mV
1: 1439MHz 812mV
2: 2079MHz 1157mV
OD_RANGE:
SCLK:     800Mhz       2150Mhz
MCLK:     625Mhz        950Mhz
VDDC_CURVE_SCLK[0]:     800Mhz       2150Mhz
VDDC_CURVE_VOLT[0]:     750mV        1200mV
VDDC_CURVE_SCLK[1]:     800Mhz       2150Mhz
VDDC_CURVE_VOLT[1]:     750mV        1200mV
VDDC_CURVE_SCLK[2]:     800Mhz       2150Mhz
VDDC_CURVE_VOLT[2]:     750mV        1200mV

When I attempt to run the script, the terminal will hang for a few seconds then output the following.

Writen initial backup states to /tmp/amdgpu-custom-states.card0.initial
Detecting the state values at /sys/class/drm/card0/device/pp_od_clk_voltage:
  SCLK state 0: 800Mhz
  SCLK state 1: 2079Mhz
  MCLK state 1: 875MHz
  VDDC Curve state 0: 800MHz 706mV
  VDDC Curve state 1: 1439MHz 812mV
  VDDC Curve state 2: 2079MHz 1155mV
  Maximum clocks & voltages:
    SCLK clock 2150Mhz
    MCLK clock 950Mhz
  Curent power cap: 190W
Verifying user state values at /etc/default/amdgpu-custom-states.card0:
  SCLK state 0: 800Mhz
  SCLK state 1: 1300MHz
  MCLK state 1: 875MHz
  VDDC Curve state 0: 800MHz 750mV
  VDDC Curve state 1: 1000MHz 775mV
  VDDC Curve state 2: 1300MHz 800mV
  Force power cap to 165W
  Force performance level to manual
Committing custom states to /sys/class/drm/card0/device/pp_od_clk_voltage:
  Done

If I run cat immediately on pp_od_clk_voltage, everything appears normal.

OD_SCLK:
0: 800Mhz
1: 1300Mhz
OD_MCLK:
1: 900MHz
OD_VDDC_CURVE:
0: 800MHz 750mV
1: 1000MHz 775mV
2: 1300MHz 800mV
OD_RANGE:
SCLK:     800Mhz       2150Mhz
MCLK:     625Mhz        950Mhz
VDDC_CURVE_SCLK[0]:     800Mhz       2150Mhz
VDDC_CURVE_VOLT[0]:     750mV        1200mV
VDDC_CURVE_SCLK[1]:     800Mhz       2150Mhz
VDDC_CURVE_VOLT[1]:     750mV        1200mV
VDDC_CURVE_SCLK[2]:     800Mhz       2150Mhz
VDDC_CURVE_VOLT[2]:     750mV        1200mV

However, if I then wait a few seconds and run cat again, some of the settings appear to have been reverted.

OD_SCLK:
0: 800Mhz
1: 2100Mhz
OD_MCLK:
1: 875MHz
OD_VDDC_CURVE:
0: 800MHz 750mV
1: 1450MHz 806mV
2: 2100MHz 793mV
OD_RANGE:
SCLK:     800Mhz       2150Mhz
MCLK:     625Mhz        950Mhz
VDDC_CURVE_SCLK[0]:     800Mhz       2150Mhz
VDDC_CURVE_VOLT[0]:     750mV        1200mV
VDDC_CURVE_SCLK[1]:     800Mhz       2150Mhz
VDDC_CURVE_VOLT[1]:     750mV        1200mV
VDDC_CURVE_SCLK[2]:     800Mhz       2150Mhz
VDDC_CURVE_VOLT[2]:     750mV        1200mV

Once this has been done, I'm not able to run many applications that use OpenCL or Mesa. Doing some research it seems to be tied to unstable voltage (But that's a separate issue).

At one point in time, I had everything working with the exact process mentioned above. Overclocks/Undervolts would apply, and the system would run stable. I restarted the system to apply a patch and started experiencing this issue again. After performing a clean install, I'm not able to replicate a successful application.

Please let me know if I missed something somewhere along the line, such as in the custom states file, or if I can provide additional information.

sibradzic commented 2 years ago

Ubuntu Server 20.04.3 LTS. I'm currently running the base kernel, 5.4.0-89-generic with AMDGPU open driver version 20.45.

That's pretty outdated. Try moving to recent kernel, anything from 5.1x will do, and uninstalling AMDGPU driver. AMDGPU open & pro drivers are often causing power management issues, especially when built against older kernels. Mainline drives are much better in that regard, google "kernel ppa" and check available options.

When I attempt to run the script, the terminal will hang for a few seconds then output the following

There should be no hang. It is possible that the driver is crashing, and the driver resets the GPU, dmesg should tell you the details.

However, if I then wait a few seconds and run cat again, some of the settings appear to have been reverted.

There must be something in kernel dmesg log when the "revert" happens? Please share the continuos output before applying amdgpu-clocks for the first time, until shortly after the revert happens. Are you absolutely sure you ain't running some other over/under clock/volt tool that is messing up with your power settings?

Oh, by the way, your issue has probably nothing to do with amdgpu-clocks itself, there is not much I can do to fix an actual driver provided by AMD, so switch to recent mainline kernel, remove the non-mainline provided "driver" and check everything again... That distro is also pretty old, try booting something like 21.10 for the test, everything should just work out of the box, no need to install any "driver" or anything.

Halornek commented 2 years ago

Thanks for the quick response. I figured that this was somewhere along a driver issue, though I still do greatly appreciate the assistance and guidance. Was hoping to get it working with the official drivers for easy OpenCL support, but I can always work with ROCm.

I've attached a dmesg log from just before running the clocks script (At least I think this is the correct file).

dmesglog202110312128.txt

It looks as if you are correct on the driver rebooting.

I shouldn't have anything else affecting clocks/voltage, as this was a fresh install as of about 8 hours ago.

I uninstalled the AMDGPU 20.45, rebooted, and attempted to run the script. This actually appeared to function, but left me without OpenCL support due to some issues installing ROCm on kernel 5.4. I updated to at 5.11, and even without the AMDGPU open drivers I once again experience the same behavior as before (Hanging for a few seconds on the script, saying it's successful, then checking pp_od_clk_voltage and some of the items have reverted.

I tested 5.11 both with and without AMDGPU 21.30 drivers and both experienced the issue.

A this point, this seems to be something driver related that I do not believe you would be able to fix. If you would be able to point me in the direction that I could research on my own I would appreciate it.

sibradzic commented 2 years ago

I've attached a dmesg log from just before running the clocks script (At least I think this is the correct file).

Yes, but what 's in there after you apply the script and the revert happening?

point me in the direction that I could research on my own I would appreciate it

To double check if you are really running into the driver issue, clean install Ubuntu 21.10 without any AMDGPU stuff, set amdgpu.ppfeaturemask=0xffffffff, and try the script.

Halornek commented 2 years ago

Yes, but what 's in there after you apply the script and the revert happening?

Sorry, I should have been more specific. That attachment was just after startup and running the script. Line 6 [ 112.390963] is when I ran the script. Anything after that happened during or after the script.

To double check if you are really running into the driver issue, clean install Ubuntu 21.10 without any AMDGPU stuff, set amdgpu.ppfeaturemask=0xffffffff, and try the script.

I was actually already working on that with a clean install of 21.10.

This functioned perfectly fine. The script ran with no issues and I was able to verify the states with cat on pp_od_clk_voltage.

Script output:

Writen initial backup states to /tmp/amdgpu-custom-states.card0.initial
Detecting the state values at /sys/class/drm/card0/device/pp_od_clk_voltage:
  SCLK state 0: 800Mhz
  SCLK state 1: 2079Mhz
  MCLK state 1: 875MHz
  VDDC Curve state 0: 800MHz 705mV
  VDDC Curve state 1: 1439MHz 812mV
  VDDC Curve state 2: 2079MHz 1162mV
  Maximum clocks & voltages:
    SCLK clock 2150Mhz
    MCLK clock 950Mhz
  Curent power cap: 190W
Verifying user state values at /etc/default/amdgpu-custom-states.card0:
  SCLK state 0: 800Mhz
  SCLK state 1: 1300MHz
  MCLK state 1: 875MHz
  VDDC Curve state 0: 800MHz 750mV
  VDDC Curve state 1: 1000MHz 775mV
  VDDC Curve state 2: 1300MHz 800mV
  Force power cap to 165W
  Force performance level to manual
Committing custom states to /sys/class/drm/card0/device/pp_od_clk_voltage:
  Done

cat output:

OD_SCLK:
0: 800Mhz
1: 1300Mhz
OD_MCLK:
1: 875MHz
OD_VDDC_CURVE:
0: 800MHz 800mV
1: 1000MHz 800mV
2: 1300MHz 800mV
OD_RANGE:
SCLK:     800Mhz       2150Mhz
MCLK:     625Mhz        950Mhz
VDDC_CURVE_SCLK[0]:     800Mhz       2150Mhz
VDDC_CURVE_VOLT[0]:     750mV        1200mV
VDDC_CURVE_SCLK[1]:     800Mhz       2150Mhz
VDDC_CURVE_VOLT[1]:     750mV        1200mV
VDDC_CURVE_SCLK[2]:     800Mhz       2150Mhz
VDDC_CURVE_VOLT[2]:     750mV        1200mV

At this point, pretty confident we can say it's just a driver issue. I at least have a starting point to figure out OpenCL support.

Thank you once again for your help. This is a great tool and you have been very quick in your responses.

Halornek commented 2 years ago

Thank you again for your help. Thought I would post an update in case anyone ever gets stuck in the specific scenario I am in, that being running Ubuntu with a Navi1 (5000 series) GPU, wants OpenCL support through AMDGPU drivers ("Open" or Pro), and needs some form of clock/voltage control.

I managed to get everything working with the following process (Done on Ubuntu Server, but should work on Ubuntu Desktop):

Install Ubuntu 20.04.3 LTS Compile Linux Kernel 5.11 from source and apply packages Reboot into Linux Kernel 5.11 and purge original 5.4 headers Add the amdgpu.ppfeaturemask=0xffffffff to GRUB Update Grub and reboot Verify custom states are loading and sticking Install AMDGPU version 21.30 with --opencl=rocr and reboot Verify custom states are still loading and sticking

This functioned fine for me after a clean install.