ilya-zlobintsev / LACT

Linux AMDGPU Configuration Tool
MIT License
1.16k stars 30 forks source link

investigate regarding drm/amd issue 3131 #329

Closed andrew-ld closed 2 months ago

andrew-ld commented 4 months ago

hi, I am the author of the issue https://gitlab.freedesktop.org/drm/amd/-/issues/3131, I think lact developers should be aware of this issue, especially the last comments.

https://gitlab.freedesktop.org/drm/amd/-/issues/3131#note_2415553

ilya-zlobintsev commented 4 months ago

Interesting - there have already been issues with the order in which settings are applied, but lact should handle what's described in the issue fine. The current order for apply settings is:

Code that handles this: https://github.com/ilya-zlobintsev/LACT/blob/master/lact-daemon/src/server/gpu_controller/mod.rs#L719 Did you manage to hit the issue when applying the settings in lact, or are you just informing about its existence?

andrew-ld commented 4 months ago

I opened this issue to keep track of the status of things, however actually even I can't change the fan speed on my sapphire 7900xtx.

for example, I've tried firing the fans to full on the curve and also with static speed and nothing seems to happen.

ilya-zlobintsev commented 4 months ago

Is this the case only when you set the fan speed using lact, or when manually writing to the sysfs (like the examples in the linked issue) as well?

andrew-ld commented 4 months ago

lact

zenofile commented 3 months ago

Interesting - there have already been issues with the order in which settings are applied, but lact should handle what's described in the issue fine. The current order for apply settings is:

* Power cap

* Clocks table

* Performance level

* Fan curve

Code that handles this: https://github.com/ilya-zlobintsev/LACT/blob/master/lact-daemon/src/server/gpu_controller/mod.rs#L719 Did you manage to hit the issue when applying the settings in lact, or are you just informing about its existence?

When I write anything to /sys/class/drm/card?/device/gpu_od/fan_ctrl/{acoustic_limit_rpm_threshold,acoustic_target_rpm_threshold,fan_minimum_pwm,fan_target_temperature}, everything set via pp_od_clk_voltage gets ignored by the GPU, no matter in which order they are set. So it is not possible to alter, for example, fan_minimum_pwm when also setting clock speeds. fan_curve seems to be the exception in my limited testing when set before altering pp_od_clk_voltage.

Doing things manually, it is possible to set clock speeds, voltage offset, and fan curve, but I am unable to do so in LACT without it getting ignored by the GPU since LACT seems to always restore the serialized values for those aforementioned settings even if they are default values.

https://github.com/ilya-zlobintsev/LACT/blob/0d675c5b3a09be4f5fdcbc441b618cea7158d79f/lact-daemon/src/server/gpu_controller/mod.rs#L817-L841

Though I am aware this is primarily a driver issue, it would be nice to have a way to not write to all fan_ctrl/* sysfs files when applying other settings/launching lactd.

Sapphire NITRO+ RX 7900 XTX Vapor-X
Kernel 6.10.0-0.rc3.20240612git2ef5971ff345.33
ilya-zlobintsev commented 3 months ago

Makes sense, we can at least avoid writing to the files if the value is unchanged.

ilya-zlobintsev commented 3 months ago

@zenofile i've added checks for this in https://github.com/ilya-zlobintsev/LACT/commit/ca3e54015a39f7cc0c840643def5e642ef8ef101, could you test if it helps?

zenofile commented 3 months ago

Thanks for looking into this. When the Automatic fan mode is enabled with default values, it seems it is working like intended, however when Curve is active, even with default values, it doesn't seem to work.

Thermals → Automatic, default values OC → Basic → Clocks + Voltage offset altered → Apply

⇒ OC Values are applied and working, however fan_curve is still written to (reset?).

Click to expand inotify event list Each inotify event report is from a single application of said values. # inotifywait -r -m -e modify . Setting up watches. Beware: since -r was given, this may take a while! Watches established. ./ MODIFY pp_od_clk_voltage ./ MODIFY power_dpm_force_performance_level ./ MODIFY pp_od_clk_voltage ./ MODIFY power_dpm_force_performance_level ./gpu_od/fan_ctrl/ MODIFY fan_curve ./ MODIFY pp_od_clk_voltage ./ MODIFY power_dpm_force_performance_level ./ MODIFY pp_od_clk_voltage ./ MODIFY pp_od_clk_voltage ./ MODIFY power_dpm_force_performance_level ./gpu_od/fan_ctrl/ MODIFY fan_curve ./ MODIFY pp_od_clk_voltage ./ MODIFY power_dpm_force_performance_level ./ MODIFY pp_od_clk_voltage ./ MODIFY power_dpm_force_performance_level ./gpu_od/fan_ctrl/ MODIFY fan_curve ./ MODIFY pp_od_clk_voltage ./ MODIFY power_dpm_force_performance_level ./ MODIFY power_dpm_force_performance_level ./ MODIFY pp_od_clk_voltage ./ MODIFY power_dpm_force_performance_level ./gpu_od/fan_ctrl/ MODIFY fan_curve ./ MODIFY pp_od_clk_voltage ./ MODIFY power_dpm_force_performance_level ./ MODIFY power_dpm_force_performance_level ./ MODIFY pp_od_clk_voltage ./ MODIFY pp_od_clk_voltage ./ MODIFY power_dpm_force_performance_level ./gpu_od/fan_ctrl/ MODIFY fan_curve

Thermals → Curve, default values OC → Basic → Clocks + Voltage offset altered → Apply

⇒ OC values are ignored by the GPU, fan_curve is written to last.

Click to expand inotify event list # inotifywait -r -m -e modify . Setting up watches. Beware: since -r was given, this may take a while! Watches established. ./ MODIFY pp_od_clk_voltage ./ MODIFY pp_od_clk_voltage ./ MODIFY power_dpm_force_performance_level ./ MODIFY power_dpm_force_performance_level ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./ MODIFY pp_od_clk_voltage ./ MODIFY pp_od_clk_voltage ./ MODIFY power_dpm_force_performance_level ./ MODIFY power_dpm_force_performance_level ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./ MODIFY pp_od_clk_voltage ./ MODIFY pp_od_clk_voltage ./ MODIFY power_dpm_force_performance_level ./ MODIFY power_dpm_force_performance_level ./ MODIFY power_dpm_force_performance_level ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./ MODIFY pp_od_clk_voltage ./ MODIFY pp_od_clk_voltage ./ MODIFY power_dpm_force_performance_level ./ MODIFY power_dpm_force_performance_level ./ MODIFY power_dpm_force_performance_level ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./ MODIFY pp_od_clk_voltage ./ MODIFY pp_od_clk_voltage ./ MODIFY power_dpm_force_performance_level ./ MODIFY pp_od_clk_voltage ./ MODIFY pp_od_clk_voltage ./ MODIFY power_dpm_force_performance_level ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve

But this is better than it was before; now when lactd is restarted at least clockspeed and voltage values are respected in unaltered automatic fan mode (default).

zenofile commented 3 months ago

I tried experimenting with the order a little: writing any values into pp_od_clk_voltage after the fan values are committed, the OC settings get ignored by the GPU. The actual committing can be done in any order though. So ensuring to only commit at the end after everything is written, it works fine. Maybe this was clear from the beginning, but I did not find any documentation mentioning this.

Also resets can be issued on fan_curve, acoustic_limit_rpm_threshold and acoustic_target_rpm_threshold. Any reset on fan_minimum_pwm or fan_target_temperature after pp_od_clk_voltage was committed and the OC settings are getting ignored again 🤷🏻 .

For example, this works fine:

gpu=card1
device=/sys/class/drm/${gpu}/device
fan=/sys/class/drm/${gpu}/device/gpu_od/fan_ctrl

echo 'r' > $fan/fan_target_temperature
echo 'r' > $fan/acoustic_target_rpm_threshold
echo 'r' > $fan/acoustic_limit_rpm_threshold
echo 'r' > $fan/fan_minimum_pwm

sleep 0.25s

echo 'auto' > $device/power_dpm_force_performance_level

echo '25' > $fan/fan_minimum_pwm
echo '75' > $fan/fan_target_temperature

echo 's 1 2525' > $device/pp_od_clk_voltage
echo 'vo -100' > $device/pp_od_clk_voltage

echo 'c' > $fan/fan_minimum_pwm
echo 'c' > $fan/fan_target_temperaturee
echo 'c' > $device/pp_od_clk_voltage
ilya-zlobintsev commented 3 months ago

Interesting. Currently the values are committed right away, i'll see if i can make it deferred until everything is written

ilya-zlobintsev commented 3 months ago

@zenofile i've pushed the new logic where everything is committed at once to the deferred-commit branch, could you test if it works?

zenofile commented 3 months ago

Unfortunately the OD values get ignored.

Some data when launching the lact daemon, all relevant GPU settings were reset manually beforehand (but it makes no difference when not):

{
  "initramfs_type": "Dracut",
  "system_info": {
    "amdgpu_overdrive_enabled": true,
    "commit": "8638d24",
    "kernel_version": "6.10.0-0.rc3.20240612git2ef5971ff345.36.local.fc40.x86_64",
    "profile": "release",
    "version": "0.5.5"
  }
}
daemon:
  log_level: debug
  admin_groups:
  - wheel
  - sudo
  disable_clocks_cleanup: false
apply_settings_timer: 5
gpus:
  xxx-0000:03:00.0:
    fan_control_enabled: false
    fan_control_settings:
      mode: curve
      static_speed: 0.5
      temperature_key: edge
      interval_ms: 500
      curve:
        40: 0.15
        50: 0.29999998
        60: 0.45
        70: 0.65
        80: 0.9
      spindown_delay_ms: 0
      change_threshold: 0
    pmfw_options:
      acoustic_limit: 3200
      acoustic_target: 1450
      minimum_pwm: 25
      target_temperature: 75
    performance_level: auto
    max_core_clock: 2525
    voltage_offset: -100
    power_states: {}
DEBUG lact_daemon: current system uptime: 3162.4s
 INFO lact_daemon::socket: listening on "/var/run/lactd.sock"
DEBUG lact_daemon::server::handler: initialized GPU controller xxx-0000:03:00.0 for path "/sys/class/drm/card1/device"
DEBUG lact_daemon::server::handler: found intialized drm entry for device "/sys/bus/pci/devices/0000:03:00.0"
 INFO lact_daemon::server::handler: initialized 1 GPUs
DEBUG lact_daemon::server::gpu_controller: writing clocks commands: [
    "s 1 2525",
    "vo -100",
]
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./ MODIFY power_dpm_force_performance_level
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_target_temperature
./gpu_od/fan_ctrl/ MODIFY fan_minimum_pwm
./ MODIFY pp_od_clk_voltage
./ MODIFY pp_od_clk_voltage
./gpu_od/fan_ctrl/ MODIFY fan_target_temperature
./gpu_od/fan_ctrl/ MODIFY fan_minimum_pwm

When altering fan and clock settings in the GUI and applying, the values are ignored as well and the inotify event list is quite extensive.

inotify events ``` ./ MODIFY pp_od_clk_voltage ./ MODIFY pp_od_clk_voltage ./ MODIFY power_dpm_force_performance_level ./ MODIFY power_dpm_force_performance_level ./ MODIFY pp_od_clk_voltage ./ MODIFY power_dpm_force_performance_level ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_target_temperature ./gpu_od/fan_ctrl/ MODIFY fan_minimum_pwm ./ MODIFY pp_od_clk_voltage ./ MODIFY pp_od_clk_voltage ./gpu_od/fan_ctrl/ MODIFY fan_target_temperature ./gpu_od/fan_ctrl/ MODIFY fan_minimum_pwm ./ MODIFY pp_od_clk_voltage ./ MODIFY pp_od_clk_voltage ./ MODIFY power_dpm_force_performance_level ./ MODIFY pp_od_clk_voltage ./ MODIFY power_dpm_force_performance_level ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_target_temperature ./gpu_od/fan_ctrl/ MODIFY fan_minimum_pwm ./ MODIFY pp_od_clk_voltage ./ MODIFY pp_od_clk_voltage ./gpu_od/fan_ctrl/ MODIFY fan_target_temperature ./gpu_od/fan_ctrl/ MODIFY fan_minimum_pwm ./ MODIFY pp_od_clk_voltage ./ MODIFY pp_od_clk_voltage ./ MODIFY power_dpm_force_performance_level ./ MODIFY pp_od_clk_voltage ./ MODIFY power_dpm_force_performance_level ./ MODIFY power_dpm_force_performance_level ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_target_temperature ./gpu_od/fan_ctrl/ MODIFY fan_minimum_pwm ./ MODIFY pp_od_clk_voltage ./ MODIFY pp_od_clk_voltage ./gpu_od/fan_ctrl/ MODIFY fan_target_temperature ./gpu_od/fan_ctrl/ MODIFY fan_minimum_pwm ./ MODIFY pp_od_clk_voltage ./ MODIFY pp_od_clk_voltage ./ MODIFY power_dpm_force_performance_level ./ MODIFY pp_od_clk_voltage ./ MODIFY power_dpm_force_performance_level ./ MODIFY power_dpm_force_performance_level ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_target_temperature ./gpu_od/fan_ctrl/ MODIFY fan_minimum_pwm ./ MODIFY pp_od_clk_voltage ./ MODIFY pp_od_clk_voltage ./gpu_od/fan_ctrl/ MODIFY fan_target_temperature ./gpu_od/fan_ctrl/ MODIFY fan_minimum_pwm ./ MODIFY pp_od_clk_voltage ./ MODIFY pp_od_clk_voltage ./ MODIFY power_dpm_force_performance_level ./ MODIFY pp_od_clk_voltage ./ MODIFY pp_od_clk_voltage ./ MODIFY power_dpm_force_performance_level ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_curve ./gpu_od/fan_ctrl/ MODIFY fan_target_temperature ./gpu_od/fan_ctrl/ MODIFY fan_minimum_pwm ./ MODIFY pp_od_clk_voltage ./ MODIFY pp_od_clk_voltage ./gpu_od/fan_ctrl/ MODIFY fan_target_temperature ./gpu_od/fan_ctrl/ MODIFY fan_minimum_pwm ```

It would help to see what is actually written to the sysfs by the daemon, is there a logging setting I can enable? Debug level seems to only print clockspeed settings.

zenofile commented 3 months ago

I did strace the writes and tried it manually in that order. The culprit is the reset on fan_curve. Somehow in this example, it causes issues. When leaving it out or moving it after the writes to fan_target_temperature and fan_minimum_pwm or before writes to pp_od_clk_voltage, it seems to work fine. What a mess.

write(10</sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/pp_od_clk_voltage>, "r\n", 2) = 2
write(10</sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/power_dpm_force_performance_level>, "auto", 4) = 4
write(10</sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/pp_od_clk_voltage>, "s 1 2525\n", 9) = 9
write(10</sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/pp_od_clk_voltage>, "vo -100\n", 8) = 8
write(10</sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/power_dpm_force_performance_level>, "auto", 4) = 4
** write(10</sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/gpu_od/fan_ctrl/fan_curve>, "r\n", 2) = 2
write(10</sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/gpu_od/fan_ctrl/fan_target_temperature>, "76\n", 3) = 3
write(10</sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/gpu_od/fan_ctrl/fan_minimum_pwm>, "26\n", 3) = 3
write(10</sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/pp_od_clk_voltage>, "c\n", 2) = 2
write(10</sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/gpu_od/fan_ctrl/fan_target_temperature>, "c\n", 2) = 2
write(10</sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/gpu_od/fan_ctrl/fan_minimum_pwm>, "c\n", 2) = 2
ilya-zlobintsev commented 3 months ago

I've pushed a commit to reset the fan curve after writing other pmfw values, please tell me if it helps. And thanks for the detailed debug - it's unfortunate that this is so fragile.

zenofile commented 3 months ago

It works. Restarting the daemon and altering and applying settings via GUI without daemon restart.

ilya-zlobintsev commented 3 months ago

Good to know, I will merge these changes then.

ilya-zlobintsev commented 3 months ago

@andrew-ld could you check if this also solves the problem for you?

ilya-zlobintsev commented 2 months ago

Closing as this has been implemented and released.