Open markhagemann opened 3 months ago
I would like to see what happens before it segfaults - try running with verbose mode turned on and with sudo (verbose mode does not output to systemd journal).
Seems to hang indefinitely with verbose mode.
➜ sudo python3 nvml-undervolt.py --core-offset 100 --target-clock 1725 --transition-clock 1500 --power-limit 285 --temperature-limit 72
Detected NVIDIA GeForce RTX 3080 Ti (GPU-a2cb5b35-c9ba-34eb-f3ae-1f4687448ffa)
Warning: Persistence mode is already enabled - make sure no other script is controlling clocks
Running main loop (sleep = 0.5)...
[1] 18719 segmentation fault sudo python3 nvml-undervolt.py --core-offset 100 --target-clock 1725 1500
➜ sudo python3 nvml-undervolt.py --verbose --core-offset 100 --target-clock 1725 --transition-clock 1500 --power-limit 285 --temperature-limit 72
Namespace(env=None, index=0, uuid='', core_offset=100, memory_offset=0, target_clock=1725, transition_clock=1500, curve=False, curve_increment=0.0, clock_step=0.0, power_limit=285, temperature_limit=72, pstates=0, sleep=0.5, verbose=True, test=False)
Detected NVIDIA GeForce RTX 3080 Ti (GPU-a2cb5b35-c9ba-34eb-f3ae-1f4687448ffa)
Supported core clocks: [2100, 2085, 2070, 2055, 2040, 2025, 2010, 1995, 1980, 1965, 1950, 1935, 1920, 1905, 1890, 1875, 1860, 1845, 1830, 1815, 1800, 1785, 1770, 1755, 1740, 1725, 1710, 1695, 1680, 1665, 1650, 1635, 1620, 1605, 1590, 1575, 1560, 1545, 1530, 1515, 1500, 1485, 1470, 1455, 1440, 1425, 1410, 1395, 1380, 1365, 1350, 1335, 1320, 1305, 1290, 1275, 1260, 1245, 1230, 1215, 1200, 1185, 1170, 1155, 1140, 1125, 1110, 1095, 1080, 1065, 1050, 1035, 1020, 1005, 990, 975, 960, 945, 930, 915, 900, 885, 870, 855, 840, 825, 810, 795, 780, 765, 750, 735, 720, 705, 690, 675, 660, 645, 630, 615, 600, 585, 570, 555, 540, 525, 510, 495, 480, 465, 450, 435, 420, 405, 390, 375, 360, 345, 330, 315, 300, 285, 270, 255, 240, 225, 210]
Clock step is 15.0 MHz
Warning: Persistence mode is already enabled - make sure no other script is controlling clocks
Setting power limit to 285 W
Setting temperature limit to 72 C
Running main loop (sleep = 0.5)...
Disabling undervolt settings at P5 1500
Setting core offset to 0
Locking core clocks at 0 - 1500
Enabling undervolt settings at P0 1500
Locking core clocks at 1500 - 1725
Setting core offset to 100
Disabling undervolt settings at P5 1500
Setting core offset to 0
Locking core clocks at 0 - 1500
Disabling undervolt settings at P5 1500 Setting core offset to 0 Locking core clocks at 0 - 1500
Is the card under load at this point? Perhaps your card just isn't stable at 1500MHz with +100 offset ?
Possibly but I've experimented with other values including the stable ones that work in Windows with Afterburner (will attach screenshot) such as 1410 and it doesn't make a difference in verbose mode. It all still seems to work though so I am quite content for the moment but will be interesting to if anyone else encounters it.
That segfault might also just be a bug in the NVIDIA's lib, that wouldn't be a surprise. My script is very simple - using the API as documented - so really nothing else comes to my mind right now...
Perhaps there is a difference in the API behavior between Windows and Linux - I tested the script on Windows only.
Edit: Tested it on Linux myself and it does indeed seems to happen, this has to be platform specific bug because I was able to keep the script running for over an hour while playing a game on Windows and it did it job fine.
I imported faulthandler and got some info:
Problem seems to be with setting pstate clocks