sibradzic / amdgpu-clocks

Simple script to control power states of amdgpu driven GPUs
GNU General Public License v2.0
390 stars 43 forks source link

Radeon 6000 #32

Closed erickgruis closed 3 years ago

erickgruis commented 3 years ago

I can't get the script to work with an AMD 6900xt. The pp_od_clk_voltage file has a slightly different output than my Radeon VII did:

OD_SCLK:
0: 500Mhz
1: 1200Mhz
OD_MCLK:
0: 97Mhz
1: 1075MHz
OD_VDDGFX_OFFSET:
0mV
OD_RANGE:
SCLK:     500Mhz       3000Mhz
MCLK:     674Mhz       1075Mhz

I'm suspecting that OD_VDDGFX_OFFSET line is causing the issue but my coding skills aren't that strong and I could't quite break down the script to confirm that.

Here is my custom states file:

OD_SCLK:
1: 1200Mhz
OD_MCLK:
1: 1075Mhz

Here is the script output:

Won't write initial state to /tmp/amdgpu-custom-states.card0.initial, it already exists.
Detecting the state values at /sys/class/drm/card0/device/pp_od_clk_voltage:
  SCLK state 0: 500Mhz
  SCLK state 1: 1200Mhz
  MCLK state 0: 97Mhz
  MCLK state 1: 1075MHz
  Unexpected value in /sys/class/drm/card0/device/pp_od_clk_voltage:7
sibradzic commented 3 years ago

Thanks for reporting. What's your kernel and driver version? Can you please properly format the console outputs of your original 9600XT /sys/class/drm/card0/device/pp_od_clk_voltage & amdgpu-custom-states.card0 in your previous message, using "Insert code" or similar?

erickgruis commented 3 years ago

Hopefully that formatting is OK.

Kernel version 5.12 mhwd-amdgpu version 19.1.0-1 xf86-video-amdgpu version 19.1.0-2

I have the openCL driver from the AMD Pro package installed, also. opencl-amd version 20.50.1234664-5

I did have the same issue on kernel 5.11 and 5.10.

sibradzic commented 3 years ago

Hopefully that formatting is OK.

No, is is slightly worse than before :) All newlines are missing. I can only help you if you provide output identical to what your console is giving you.

Kernel version 5.12

That hadn't been released yet, you are on some RC? If yes, which one?

mhwd-amdgpu version 19.1.0-1

Sorry, what is that mhwd thing?

So, is your amdgpu driver from upstream 5.12 rcX or from amdgpu-pro 20.50? Can you paste the output of this one ("code" formatted please): modinfo amdgpu | grep "file\|vers\|magic\|desc" ?

erickgruis commented 3 years ago

I figured out the formatting finally.

amdgpu | grep "file\|vers\|magic\|desc"
description:    AMD GPU
vermagic:       5.12.0-1-MANJARO SMP preempt mod_unload

mhwd-amdgpu is on all of my Manjaro systems. The description says:

"MHWD module-ids for amdgpu"

Kernel is 5.12 rc3 on Manjaro. I have 3 systems with identical Linux installs. The other 2 are running Radeon VII's and one has 2 of them and it's working fine for mining.

I'm running the amdgpu driver from the kernel with the opencl driver pulled from the amdgpu-pro 20.50. AUR has a package that extracts the opencl driver from the pro package and installs it alongside the amdgpu kernel driver.

Your help is greatly appreciated!

erickgruis commented 3 years ago

I have just confirmed that things work correctly in Windows 10. After adding the openCL registry entries, ethminer loads up both GPU's producing the expected hashrate (63.00).

So, I'm relieved it's probably not a hardware issue but I suspect something isn't right with the amdgpu kernel driver and/or the more recent amdgpu-pro openCL driver.

sibradzic commented 3 years ago

I figured out the formatting finally.

Not quite :) I've re-formatted your first comment, but I have no clue if it really matches the original console output. Check https://docs.github.com/en/github/writing-on-github/basic-writing-and-formatting-syntax#quoting-code

Anyway, it seems that there is this new addition to the kernel, one that adds OD_VDDGFX_OFFSET the Power Play interface, which is now directly controlled in pp_od_clk_voltage. This OD_VDDGFX_OFFSET thing is not yet supported by amdgpu-clocks. I could try adding some untested support for it... Since I don't have RX6x00 card myself, I can't test the change by myself, but you could.

Can I expect you to help testing it out?

lilwebsite commented 3 years ago

Hi, I have a 6000 series and can help test this out. I am manually downclocking ATM and would like to see this setting added, so I'll share my experience attempting to undervolt this card since as of right now I haven't been able to fully understand the OD_VDDGFX_OFFSET setting. Maybe you will have better insight than me when it comes to implementing this.

uname -a output for reference: Linux --- 5.12.0-1-mainline #1 SMP PREEMPT Tue, 27 Apr 2021 21:11:46 +0000 x86_64 GNU/Linux

Currently if I try using the OD_VDDC_CURVE option, it gets ignored, giving me a write error and does not output anything to dmesg. The OD_VDDGFX_OFFSET option works, but doesn't seem to be limited like the other settings; I can enter in any number into OD_VDDGFX_OFFSET without error but for something like core clock when out of range it will give a write error or a warning in dmesg. Using the default voltage setting from radeon software for windows which is 1050mV as a reference point I then set OD_VDDGFX_OFFSET to -169 which (assuming voltage defaults to 1050mV) should give me 881mV which is the minimum you can set in windows. I got insane artifacts immediately, so probably not doing what I'm expecting, unless mesa has issues running at those voltages. Running a game with some intense graphics / settings does have voltages show up to around 1080mV in /sys/class/hwmon/hwmon1/in0_input with the card at default settings, so it is a bit confusing not being able to lock the voltages to a specific target, but I am probably misunderstanding how to use this setting. Idle I am getting reported back 800mV flat, sometimes a bit lower.

Thank you for your time on this project and for reading this, feel free to ask if you want me to test some things out or if you want some other info.

sibradzic commented 3 years ago

Thanks for reaching out @lilwebsite. Ok, I've just pushed the testing OD_VDDGFX_OFFSET branch, please check if it works for you. I'm curious if it can set negative OD_VDDGFX_OFFSET values, let me know how it goes for you.

Off-topic; have you perhaps tried https://github.com/sibradzic/upp to try to control voltages and clocks directly?

lilwebsite commented 3 years ago

Nice, that was quick, I will test it out tonight. I have not seen the upp project so I will test it out as well. Power play tables seem more effective for voltage control, but when I read from the table using plain cat I wasn't sure what to do with it since I got a binary output, and so I stuck to setting things with pp_od_clk_voltage since I was already used to using that.

lilwebsite commented 3 years ago

Evening @sibradzic, I have tested and the setting seems to work but I found an problem with the script separate from the one in this thread, but let me report back on the OD_VDDGFX_OFFSET setting.

First, the setting you added works, and I was able to confirm that by giving my GPU a -100mV offset using it. To correct myself in the original post I made, I think the GPU artifacting that I got was me misunderstanding the output when reading from pp_od_clk_voltage and entering the wrong values, which is related to that other issue. As for the offset and it's behavior, with the -100mV doesn't set the voltage anywhere below the graphics card minimum (which seems to be 800mV) and clamps the voltage down to absolute minimum until the voltage starts rising above the minimum mV + offset mV. So in my case since I set -100mV the voltage would only increase from 800mV when the load demanded above 900mV. I also double checked my results by giving the card a higher 2400MHz clock and setting the voltages to -120mV, -100mV, and -50mV. All of them behaved as expected, reducing the voltage from the 1080mV ceiling to 960mV, 980mV, and 1030mV respectively. I am not sure about the behavior of any voltages above the ceiling, I can test this later but I would want to make sure I am not raising them too high, since I'm unsure about the behavior of these cards past 1100mV.

I do have a suggestion: When I went to edit the config file under /etc/default/amdgpu-custom-state-overclock.card0 I had to go read the script file to figure out where to put the OD_VDDGFX_OFFSET value, here's what my config file looks like right now:

OD_SCLK:
0: 500MHz
1: 2400MHz
OD_MCLK:
1: 1005MHz
OD_VDDGFX_OFFSET:
-120mV
FORCE_PERF_LEVEL: manual

Since OD_VDDGFX_OFFSET is a single value like FORCE_PERF_LEVEL and FORCE_POWER_CAP I expected to put the value on the same line, but it only worked below it. Not sure if this was intentional or not.

As for the other issue I mentioned earlier, this is what the output of my pp_od_clk_voltage file is:

OD_SCLK:
0: 500Mhz
1: 2400Mhz
OD_MCLK:
0: 97Mhz
1: 1005MHz
OD_VDDGFX_OFFSET:
-120mV
OD_RANGE:
SCLK:     500Mhz       2800Mhz
MCLK:     674Mhz       1075Mhz

As you can see state 0 for OD_MCLK is 97MHz which is pretty strange since I can't set it below the first MCLK value under OD_RANGE. This is what happens when I run amdgpu_clocks with no OD_MCLK values in my /etc/default/amdgpu-custom-state-overclock.card0 file:

[lilwebsite@fester amdgpu-clocks]$ sudo ./amdgpu-clocks
Won't write initial state to /tmp/amdgpu-custom-state-overclock.card0.initial, it already exists.
Detecting the state values at /sys/class/drm/card0/device/pp_od_clk_voltage:
  SCLK state 0: 500Mhz
  SCLK state 1: 2400Mhz
  MCLK state 0: 97Mhz
  MCLK state 1: 1005MHz
  VDD GFX Offset: -120mV
  Maximum clocks & voltages:
    SCLK clock 2800Mhz
    MCLK clock 1075Mhz
  Curent power cap: 272W
Verifying user state values at /etc/default/amdgpu-custom-state-overclock.card0:
  SCLK state 0: 500MHz
  SCLK state 1: 2400MHz
  VDD GFX Offset: -120mV
  Force performance level to manual
Committing custom states to /sys/class/drm/card0/device/pp_od_clk_voltage:
setting s 0 500
setting s 1 2400
setting m 0 97
./amdgpu-clocks: line 156: echo: write error: Invalid argument
setting m 1 1005
  Done

[lilwebsite@fester ~]$ sudo dmesg | grep amdgpu
...
[ 1850.185025] amdgpu 0000:2b:00.0: amdgpu: OD setting (6, 97) is less than the minimum allowed (674)

I modified your script to output the setting before it was set so it is a bit easier to see what was causing the issue, but I included dmesg output anyways. It seems like the script is reading from OD_MCLK to get the initial state, when it needs to read from OD_RANGE. I am not so sure about this though. If I set my OD_MCLK to 674MHz I get insane artifacts, in fact, pretty much every value I have entered in so far seems to cause it. I set it to a generous 1000MHz and that seems to cause artifacting as well, so I am not quite sure what's going on here. This likely is another issue entirely, and the values reported about might be from my card specifically, so when I get time I will probably look into this myself but I thought it was something you should be aware of.

For now though, things are looking good. If you want me to create a separate issue for the memory clock I can do that and work more on this tomorrow.

sibradzic commented 3 years ago

Since OD_VDDGFX_OFFSET is a single value like FORCE_PERF_LEVEL and FORCE_POWER_CAP I expected to put the value on the same line, but it only worked below it. Not sure if this was intentional or not.

It's intentional, because it is the very same "format" enforced by /sys/class/drm/card0/device/pp_od_clk_voltage. IMHO it is ugly, because as you said, it is just a single value param, but since it is what kernel devs thought that make sense, don't blame me :)

It seems like the script is reading from OD_MCLK to get the initial state, when it needs to read from OD_RANGE. I am not so sure about this though. If I set my OD_MCLK to 674MHz I get insane artifacts, in fact, pretty much every value I have entered in so far seems to cause it. I set it to a generous 1000MHz and that seems to cause artifacting as well, so I am not quite sure what's going on here.

Yes, I've seen this before. You are doing nothing wrong, and the script is doing totally expected thing as well; setting the same value as the original state 0 indicates. What likely happens here is that these new cards drivers still depend on somewhat weird firmware blobs, where such inconsistencies exist. The default lowest state MEM clock is set to 97MHz, but the lowest limit enforced in vBIOS/firmware seems to be 674Mhz, which is an obvious inconsistency. I had something similar on my RX5700, and I could "fix" the thing by setting lower limit with upp.

So, when the script "fails" to set state 0, nothing bad really happens, the state will not change from the default, and perhaps the best course of action is to just to ignore this thing altogether and hope AMD would fix their firmware eventually. But if this continues to happen on new and future cards, I will consider validating user changes against limits and ignoring these inconsistent 0 state changes altogether in the script...

lilwebsite commented 3 years ago

It's intentional, because it is the very same "format" enforced by /sys/class/drm/card0/device/pp_od_clk_voltage. IMHO it is ugly, because as you said, it is just a single value param, but since it is what kernel devs thought that make sense, don't blame me :)

Sorry, I did not mean to blame you but just thought it looked strange. I see now that the output from pp_od_clk_voltage matches the configuration and that fact just went over my head, whoops!

What likely happens here is that these new cards drivers still depend on somewhat weird firmware blobs, where such inconsistencies exist. The default lowest state MEM clock is set to 97MHz, but the lowest limit enforced in vBIOS/firmware seems to be 674Mhz, which is an obvious inconsistency.

This is what I was thinking as well. For some reason I am not surprised that there is some firmware weirdness going on. Honestly, it is a bit nice despite throwing an error that it is not actually setting state 0 for memory since I would have to reboot my PC if that was the case. Maybe ignoring the state 0 for MCLK and SCLK unless the user specifies otherwise in the config is the best way to go with this, but that is not up to me; it works either way.

I had something similar on my RX5700, and I could "fix" the thing by setting lower limit with upp.

I took a peek at upp last night, but I didn't have a lot of time to figure out which values I should be setting. I think I can see what I need to edit (I get a lot of values when I dump the table) but I'm probably going to have to do so on a weekend when I have more time to play around with things. Seems like a good idea though, if there's a way to fix state 0 for my card it would be nice so everything is consistent and so I don't get errors.

sibradzic commented 3 years ago

I took a peek at upp last night, but I didn't have a lot of time to figure out which values I should be setting.

Yes, the number of values and the lack of descriptions is a bit overwhelming. The limit are set by /overdrive_table/{min,max}/n, and for example, I can set these on my RX5700

sudo -E env "PATH=$PATH" upp set --write \
  overdrive_table/min/1=700   \
  overdrive_table/max/0=1650  \
  overdrive_table/min/8=650   \
  overdrive_table/max/8=800

to get this in /sys/class/drm/card0/device/pp_od_clk_voltage:

OD_RANGE:
SCLK:     700Mhz       1650Mhz
MCLK:     650Mhz        800Mhz

You may need to experiment a bit with setting min/max values, check for matching clock values in the whole min/max table. IIRC I also had to change /overdrive_table/min/{3,5,7} at some point in order for pp_od_clk_voltage to accept smaller lowest state values (dmesg will warn you if you cant go beyond limit). The voltages are represented by some weird 4x milliwatts values in the pp table, so for example, if you see value like MinVoltageGfx: 3300 it actually represents 3300/4 = 825mV :roll_eyes:

sibradzic commented 3 years ago

So, I was gonna merge the OD_VDDGFX_OFFSET as is to the master. You ok with that?

lilwebsite commented 3 years ago

So, I was gonna merge the OD_VDDGFX_OFFSET as is to the master. You ok with that?

Yes, sounds good, I'll report back on the overdrive table stuff on a later date since the OD_VDDGFX_OFFSET works and I haven't had much more time to mess around beyond just getting it working.

loudan-arc commented 1 year ago

It appears with the RX 6000 series only offsetting the possible max voltage is the way to go even now? I do not see an equivalent way to uae the VDDC curve unlike with the RX 5700 XT.