sibradzic / amdgpu-clocks

Simple script to control power states of amdgpu driven GPUs
GNU General Public License v2.0
390 stars 43 forks source link

/usr/local/bin/amdgpu-clocks: line 154: echo: write error: Invalid argument #40

Closed GreatBigWhiteWorld closed 2 years ago

GreatBigWhiteWorld commented 2 years ago

Hi, any idea what's causing this? I have a very simple profile which only force SCLK state and performance level.

sudo USER_STATES_PATH=custom-states amdgpu-clocks
Writen initial backup states to /tmp/custom-states.card0.initial
Detecting the state values at /sys/class/drm/card0/device/pp_od_clk_voltage:
  SCLK state 0: 700Mhz
  SCLK state 1: 2649Mhz
  MCLK state 0: 97Mhz
  MCLK state 1: 1000MHz
  VDD GFX Offset: 0mV
  Maximum clocks & voltages:
    SCLK clock 3150Mhz
    MCLK clock 1200Mhz
  Curent power cap: 135W
Verifying user state values at custom-states.card0:
  Force SCLK state to 1 2 3 4
  Force performance level to manual
Committing custom states to /sys/class/drm/card0/device/pp_od_clk_voltage:
/usr/local/bin/amdgpu-clocks: line 154: echo: write error: Invalid argument
  Done

Two more unrelated questions (let me know if I should start a new issue for these)

  1. Should performance level set to 'manual' mandatory if I use this script? I have a poor understanding of what manually really means...
  2. Can you elaborate on using the 'restore' parameter with this tool? I tried 'amdgpu-clocks restore' and got errors for not setting up custom-states in /etc/default/custome-states.card0, as I store profiles elsewhere.
itspngu commented 2 years ago

I'm having the exact same problem. Can you share your system info? The script is trying to write a value that the driver rejects, in my case 97MHz for MCLK state 0; the second value of 1000MHz for state 1 gets a pass. I'm testing with an empty /etc/default/amdgpu-custom-state.card0.

# cat /sys/class/drm/card0/device/pp_od_clk_voltage
OD_SCLK:
0: 500Mhz
1: 2519Mhz
OD_MCLK:
0: 97Mhz                      <-- it's trying to write this back as a custom state
1: 1000MHz
OD_VDDGFX_OFFSET:
0mV
OD_RANGE:
SCLK:     500Mhz       3000Mhz
MCLK:     674Mhz       1075Mhz

# cat /sys/class/drm/card0/device/pp_dpm_mclk
0: 96Mhz                     <-- 96 instead of 97, but the driver also doesn't accept it
1: 456Mhz
2: 673Mhz
3: 1000Mhz *

Replacing (I just jammed it into the script to test) m 0 97 with m 0 674 works. My PC froze when I initially replaced it with m 0 1000 (alongside m 1 1000) so I'm hitting send on this comment before trying 673 and 456. :')

Edit: It rejects 673MHz, as well as - seemingly - any value below 674MHz. It seems what's listed above in the OD_RANGE lines applies, but the script assumes the valid range for custom/overdrive pstates can be read from the list of current states. Now I need to figure out why my current states are 97 and 1000, if 97 clearly isn't allowed by the driver - that'd explain why it doesn't seem to reduce memory clock in idle (it's stuck at 1000, which is a separate problem unrelated to this tool).

Edit 2: The system journal is pretty direct about it, I should have checked this first...

kernel: amdgpu 0000:0b:00.0: amdgpu: OD setting (6, 456) is less than the minimum allowed (674)
System info: ``` # inxi -b System: Host: desk.local Kernel: 5.16.18-200.fc35.x86_64 arch: x86_64 bits: 64 Desktop: sway v: 1.6.1 Distro: Fedora release 35 (Thirty Five) Machine: Type: Desktop System: Gigabyte product: X570 AORUS PRO v: -CF serial: N/A Mobo: Gigabyte model: X570 AORUS PRO serial: N/A UEFI: American Megatrends LLC. v: F35 date: 01/04/2022 CPU: Info: 16-core AMD Ryzen 9 3950X [MT MCP] speed (MHz): avg: 2240 min/max: 2200/4761 Graphics: Device-1: AMD Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] driver: amdgpu v: kernel Display: server: X.org v: 1.20.14 driver: gpu: amdgpu note: X driver n/a resolution: 1: 1920x1080 2: 1920x1080 Message: GL data unavailable for root. Network: Device-1: Intel I211 Gigabit Network driver: igb Device-2: Intel Wi-Fi 6 AX200 driver: iwlwifi Drives: Local Storage: total: 953.87 GiB used: 893.77 GiB (93.7%) Info: Processes: 468 Uptime: 12m Memory: 31.32 GiB used: 2.07 GiB (6.6%) Shell: Bash inxi: 3.3.14 $ glxinfo -B name of display: :0 display: :0 screen: 0 direct rendering: Yes Extended renderer info (GLX_MESA_query_renderer): Vendor: AMD (0x1002) Device: AMD Radeon RX 6900 XT (SIENNA_CICHLID, DRM 3.44.0, 5.16.18-200.fc35.x86_64, LLVM 13.0.0) (0x73bf) Version: 21.3.8 Accelerated: yes Video memory: 16384MB Unified memory: no Preferred profile: core (0x1) Max core profile version: 4.6 Max compat profile version: 4.6 Max GLES1 profile version: 1.1 Max GLES[23] profile version: 3.2 Memory info (GL_ATI_meminfo): VBO free memory - total: 15936 MB, largest block: 15936 MB VBO free aux. memory - total: 16298 MB, largest block: 16298 MB Texture free memory - total: 15936 MB, largest block: 15936 MB Texture free aux. memory - total: 16298 MB, largest block: 16298 MB Renderbuffer free memory - total: 15936 MB, largest block: 15936 MB Renderbuffer free aux. memory - total: 16298 MB, largest block: 16298 MB Memory info (GL_NVX_gpu_memory_info): Dedicated video memory: 16384 MB Total available memory: 32752 MB Currently available dedicated video memory: 15936 MB OpenGL vendor string: AMD OpenGL renderer string: AMD Radeon RX 6900 XT (SIENNA_CICHLID, DRM 3.44.0, 5.16.18-200.fc35.x86_64, LLVM 13.0.0) OpenGL core profile version string: 4.6 (Core Profile) Mesa 21.3.8 OpenGL core profile shading language version string: 4.60 OpenGL core profile context flags: (none) OpenGL core profile profile mask: core profile OpenGL version string: 4.6 (Compatibility Profile) Mesa 21.3.8 OpenGL shading language version string: 4.60 OpenGL context flags: (none) OpenGL profile mask: compatibility profile OpenGL ES profile version string: OpenGL ES 3.2 Mesa 21.3.8 OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20 ```
GreatBigWhiteWorld commented 2 years ago

I'm a noob but I guess it's a bug of the script, which tries to apply a too low value to be allowed by the driver, if the expected value is not given by the custom-states profile.

I didn't test it but if anything below 673 is rejected, my understanding is that the driver rejects any mclock value that's even lower than the second highest value (673) as the highest value it can use, which would cause the machine to be unstable? So the script's default mclock range reading might be problematic. And the work around atm is to give a value for mclock in the custom-states file, e.g.: OD_MCLK: 1: 1000MHz (the default highest value if you don't care about it at all)


I just tested my solution, but setting mclk 950 or 1000 don't work either for me. It's the same line 154 error. OD_MCLK: 1: 1000MHz

This machine has an RX 6600 XT card, but it uses amdgpu driver from the official AMD repo, not from the OS distro kernel firmware, which did not work for some reason.

sibradzic commented 2 years ago

@GreatBigWhiteWorld

Hi, any idea what's causing this?

See https://github.com/sibradzic/amdgpu-clocks/issues/32

Two more unrelated questions

  1. See https://www.kernel.org/doc/html/latest/gpu/amdgpu/thermal.html#power-dpm-force-performance-level. The manual is required in order to make certain other settings possible at all, details in the link.
  2. When you run the script, you get something like Written initial backup states to /tmp/custom-states.card0.initial early in the output. The script will attempt to store the "initial" state to a file, that restore command should be able to restore. If you have an issue with restore feature, please open another issue, with details on how do you get there in the first place.
sibradzic commented 2 years ago

@GreatBigWhiteWorld @itspngu

Since this is 3rd+ time people are complaining about this, essentially a GPU ROM/firmware issue, I've decided it would be better to add a simple workaround to the script to avoid this from happening in the first place. Gimme few minutes...

sibradzic commented 2 years ago

@GreatBigWhiteWorld @itspngu

does the change I just pushed help?

itspngu commented 2 years ago

@GreatBigWhiteWorld @itspngu

does the change I just pushed help?

Yes sir! Thank you. Wasn't aware that this was a known issue.

# amdgpu-clocks
Writen initial backup states to /tmp/amdgpu-custom-state.card0.initial
Detecting the state values at /sys/class/drm/card0/device/pp_od_clk_voltage:
  SCLK state 0: 500Mhz
  SCLK state 1: 2514Mhz
  MCLK state 0: 97Mhz
  MCLK state 1: 1000MHz
  VDD GFX Offset: 0mV
  Maximum clocks & voltages:
    SCLK clock 3000Mhz
    MCLK clock 1075Mhz
  Curent power cap: 255W
Verifying user state values at /etc/default/amdgpu-custom-state.card0:
  SCLK state 0: 500Mhz
  SCLK state 1: 2519Mhz
  MCLK state 0: 674MHz
  MCLK state 1: 1000MHz
Committing custom states to /sys/class/drm/card0/device/pp_od_clk_voltage:
  Done
GreatBigWhiteWorld commented 2 years ago
  1. When you run the script, you get something like Written initial backup states to /tmp/custom-states.card0.initial early in the output. The script will attempt to store the "initial" state to a file, that restore command should be able to restore. If you have an issue with restore feature, please open another issue, with details on how do you get there in the first place.

I can confirm the new script works fine. Thanks!

Dumb question: As for the restore feature, can you tell me how to use it? I tried sudo amdgpu-clocks --restore or sudo amdgpu-clocks restore But neither works? Did you mean I need to manually cd to the /tmp folder to run the script again using the stored file?

sibradzic commented 2 years ago

Just restore, as the one and only argument. Can't help you much unless you provide full output when you run the script, restore included, preferably in a separate issue...

lextra2 commented 2 years ago

674 MHz is the lowest value for the lowest power state for memory as defined in the gpu bios. It makes sense that you cannot set it any lower. (And really, there shouldn't be any reason to do so)

Cycatz commented 1 year ago

I also encounter this issue, and I fixed it after setting performance level to 'manual' manully before running the script:

1. Should performance level set to 'manual' mandatory if I use this script? I have a poor understanding of what manually really means...

@sibradzic is it intentional ?

Cycatz commented 1 year ago

I found the root cause is related to https://github.com/sibradzic/amdgpu-clocks/issues/45. Even if the service did set the level to manual and adjust the frequency, the states seemed to be reverted back.

Is there any way to start the service after amd driver setup?