ilya-zlobintsev / LACT

Linux AMDGPU Configuration Tool
MIT License
1.15k stars 30 forks source link

PwrCap - does not stay static #194

Open saki2fifty opened 1 year ago

saki2fifty commented 1 year ago

image

When setting all devices to a max power cap of any number (in this screenshot, it was set to 100), over time it ignores this and the power adjusts automatically.

I loop through each device, but per device the commands are:

Exact Powershell:

$gpuList = ((echo '{"command": "list_devices"}' | ncat -U /run/lactd.sock | ConvertFrom-Json).data) | Where-Object { $_.name -notmatch 'HD Graphics' }

$powerCap = "100"
$performanceLevel = "manual"
$max_memory_clock = "1900"
$max_core_clock = "1100"

$gpuStats = @()

foreach ($gpu in $gpuList) {

    Write-Host "Setting gpu: $($gpu.id) - Memory: $($max_memory_clock) - Core: $($max_core_clock)" -ForegroundColor Green

    Invoke-Expression -Command "echo '{""command"": ""set_performance_level"", ""args"": {""id"": ""$($gpu.id)"", ""performance_level"": ""$($performanceLevel)""}}' | ncat -U /run/lactd.sock"
    Invoke-Expression -Command "echo '{""command"": ""confirm_pending_config"", ""args"": {""command"": ""confirm""}}' | ncat -U /run/lactd.sock"
    Invoke-Expression -Command "echo '{""command"": ""set_clocks_value"", ""args"": {""id"": ""$($gpu.id)"", ""command"": {""type"": ""max_memory_clock"", ""value"": $($max_memory_clock)}}}' | ncat -U /run/lactd.sock"
    Invoke-Expression -Command "echo '{""command"": ""confirm_pending_config"", ""args"": {""command"": ""confirm""}}' | ncat -U /run/lactd.sock"
    Invoke-Expression -Command "echo '{""command"": ""set_clocks_value"", ""args"": {""id"": ""$($gpu.id)"", ""command"": {""type"": ""max_core_clock"", ""value"": $($max_core_clock)}}}' | ncat -U /run/lactd.sock"
    Invoke-Expression -Command "echo '{""command"": ""confirm_pending_config"", ""args"": {""command"": ""confirm""}}' | ncat -U /run/lactd.sock"
    Invoke-Expression -Command "echo '{""command"": ""set_power_cap"", ""args"": {""id"": ""$($gpu.id)"", ""cap"": $($powerCap)}}' | ncat -U /run/lactd.sock"
    Invoke-Expression -Command "echo '{""command"": ""confirm_pending_config"", ""args"": {""command"": ""confirm""}}' | ncat -U /run/lactd.sock"

}

foreach ($gpu in $gpuList) {

    $deviceStats = (echo '{"command":"device_stats","args": {"id": "1002:67DF-1682:C580-0000:0b:00.0"}}' | ncat -U /run/lactd.sock | ConvertFrom-Json).data

    # Add the GPU statistics to the array
    $gpuStats += $deviceStats

    # Add custom properties to the last element of the array
    $gpuStats[-1] | Add-Member -MemberType NoteProperty -Name "GPU_ID" -Value $gpu.id
    $gpuStats[-1] | Add-Member -MemberType NoteProperty -Name "GPU_Type" -Value $gpu.name

    Start-Sleep -Seconds 1
}
saki2fifty commented 1 year ago

I have many rigs and this happens across all of them. RX580's in the screenshot.

ilya-zlobintsev commented 1 year ago

Can you provide more info on when the power limit resets? How long does it take for it to reset? Does anything else (clocks, performance level) reset? Is anything printed in system logs (dmesg)?

This could be solved by performing a check every few minutes that compares the actual settings of the gpu to the previously applied ones, though this is not an ideal solution.

Also, I have to say that I did not expect LACT to be used for mining rigs. Good to know that the API is useful though.

saki2fifty commented 1 year ago

Currently on a work call... but yeah, LACT is 100% useful. I use it instead of rocm-smi and is my default oc/stat'er.

I have a check every so many minutes to see if it changes and force it again.

I'd say it resets probably slowly starting with 15 minutes, one by one and within an hour all have reset.

But, i'll check those logs and let you know.

pinbuck commented 8 months ago

I have this issue too, no matter which power cap I set, the GPU starts "power throttling" even before it reaches the rated normal 145 watts max. I didn't get this issue on Linux Mint but I get it on Fedora 38/39 and Nobara 38 with and without amdgpu.ppfeaturemask=0xffffffff or amdgpu.ppfeaturemask=0xfffd7fff. It may be worth the time to test older kernels where this is possibly a non-issue and find which kernel version after this becomes a problem.

ilya-zlobintsev commented 8 months ago

@pinbuck this sounds like a kernel-side issue, you should report it on https://gitlab.freedesktop.org/drm/amd.