sibradzic / upp

A tool for parsing, dumping and modifying data in Radeon PowerPlay tables
GNU General Public License v3.0
154 stars 24 forks source link

UPP with AMDGPU drivers using ROCM - AMDGPU > 21.50.2: OSError: [Errno 62] Timer expired #50

Open blackmennewstyle opened 3 weeks ago

blackmennewstyle commented 3 weeks ago

Hello beautiful dev(s),

I'm a long time user of your nice piece of software but i noticed since the latest AMDGPU drivers which use ROCM, upp does not seem allowed to change anything related to memory.

I'm using Ubuntu 20.04.6 LTS (GNU/Linux 5.13.0-52-generic x86_64)

pip3 show upp
Name: upp
Version: 0.2.0
Summary: Uplift Power Play
Home-page: https://github.com/sibradzic/upp
Author: Samir Ibradžić
Author-email: None
License: None
Location: /usr/local/lib/python3.8/dist-packages
Requires: setuptools, click
Required-by:

The following command always returns:

sudo upp -p /sys/class/drm/card0/device/pp_table set --write smc_pptable/MinVoltageGfx=2400 smc_pptable/MaxVoltageGfx=2948 smc_pptable/MinVoltageSoc=2400 smc_pptable/MaxVoltageSoc=3980 smc_pptable/FreqTableSocclk/0=96 smc_pptable/FreqTableSocclk/1=980 smc_pptable/Mp0DpmVoltage/0=2800 smc_pptable/Mp0DpmVoltage/1=3160 smc_pptable/MemVddciVoltage/0=2700 smc_pptable/MemVddciVoltage/1=3360 smc_pptable/MemVddciVoltage/2=3360 smc_pptable/MemVddciVoltage/2=3360 smc_pptable/MemVddciVoltage/3=3360 smc_pptable/MemMvddVoltage/0=5000 smc_pptable/MemMvddVoltage/1=5360 smc_pptable/MemMvddVoltage/2=5360 smc_pptable/MemMvddVoltage/3=5360
Changing smc_pptable.MinVoltageGfx from 3094 to 2400 at 0x3a6
Changing smc_pptable.MaxVoltageGfx from 4600 to 2948 at 0x3aa
Changing smc_pptable.MinVoltageSoc from 3050 to 2400 at 0x3a8
Changing smc_pptable.MaxVoltageSoc from 4200 to 3980 at 0x3ac
Changing smc_pptable.FreqTableSocclk.0 from 418 to 96 at 0x56e
Changing smc_pptable.FreqTableSocclk.1 from 1280 to 980 at 0x570
Changing smc_pptable.Mp0DpmVoltage.0 from 2800 to 2800 at 0x666
Changing smc_pptable.Mp0DpmVoltage.1 from 3200 to 3160 at 0x668
Changing smc_pptable.MemVddciVoltage.0 from 2700 to 2700 at 0x66a
Changing smc_pptable.MemVddciVoltage.1 from 3400 to 3360 at 0x66c
Changing smc_pptable.MemVddciVoltage.2 from 3400 to 3360 at 0x66e
Changing smc_pptable.MemVddciVoltage.2 from 3400 to 3360 at 0x66e
Changing smc_pptable.MemVddciVoltage.3 from 3400 to 3360 at 0x670
Changing smc_pptable.MemMvddVoltage.0 from 5000 to 5000 at 0x672
Changing smc_pptable.MemMvddVoltage.1 from 5400 to 5360 at 0x674
Changing smc_pptable.MemMvddVoltage.2 from 5400 to 5360 at 0x676
Changing smc_pptable.MemMvddVoltage.3 from 5400 to 5360 at 0x678
Committing changes to '/sys/class/drm/card0/device/pp_table'.
Traceback (most recent call last):
  File "/usr/local/bin/upp", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/upp/upp.py", line 469, in main
    cli(obj={})()
  File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/upp/upp.py", line 449, in set
    decode._write_binary_file(pp_file, pp_bytes)
  File "/usr/local/lib/python3.8/dist-packages/upp/decode.py", line 56, in _write_binary_file
    f.close()
OSError: [Errno 62] Timer expired

GPU also dies instantly:

amdgpu-pci-0300
Adapter: PCI adapter
vddgfx:           N/A
fan1:             N/A  (min =    0 RPM, max =    0 RPM)
edge:             N/A  (crit = +100.0°C, hyst = -273.1°C)
                       (emerg = +105.0°C)
junction:         N/A  (crit = +110.0°C, hyst = -273.1°C)
                       (emerg = +115.0°C)
mem:              N/A  (crit = +100.0°C, hyst = -273.1°C)
                       (emerg = +105.0°C)
PPT:              N/A  (cap =   0.00 W)

If i downgrade with AMDGPU drivers <= 21.50.2, everything works fine. Is it a well known issue? Is it something in the latest drivers?

sibradzic commented 2 weeks ago

This is definitely a driver issue, it likely crashed (check dmesg) on table changes commit and the upp is just stuck waiting. The pp_table ABI is pretty unstable, especially if you use fairly recent AMDGPU driver on fairly old kernel.

You may have better luck using more recent kernel and using an included open upstream driver. I did however have experienced driver crashes on 6.10 with any change in pp_table on my RX6600, but it just works with 6.11.

blackmennewstyle commented 2 weeks ago

This is definitely a driver issue, it likely crashed (check dmesg) on table changes commit and the upp is just stuck waiting. The pp_table ABI is pretty unstable, especially if you use fairly recent AMDGPU driver on fairly old kernel.

You may have better luck using more recent kernel and using an included open upstream driver. I did however have experienced driver crashes on 6.10 with any change in pp_table on my RX6600, but it just works with 6.11.

Thanks for your reply,

It actually worked BUT ONLY IF i execute the following prior: echo "profile_peak" | sudo tee -a /sys/class/drm/card0/device/power_dpm_force_performance_level :exploding_head:

I never heard of that profile before lol

I actually moved to Ubuntu 24.04 and same behavior, i must use that profile if i want to modify the pp_table.

I also noticed that echo "50000000" | sudo tee -a /sys/class/drm/card0/device/hwmon/hwmon*/power1_cap does not work anymore with the ROCM drivers.