jacklul / nvml-scripts

Scripts to control NVIDIA GPUs using NVML API
MIT License
14 stars 0 forks source link

SEVG fault from PState #1

Open markhagemann opened 3 months ago

markhagemann commented 3 months ago

I imported faulthandler and got some info:

Jul 20 22:41:46 drache nvml-undervolt[86317]: Warning: Persistence mode is already enabled - make sure no oth>
Jul 20 22:43:35 drache nvml-undervolt[86317]: Fatal Python error: Segmentation fault
Jul 20 22:43:35 drache nvml-undervolt[86317]: Current thread 0x000076ad66104740 (most recent call first):
Jul 20 22:43:35 drache nvml-undervolt[86317]:   File "/home/drache/.pyenv/versions/3.12.4/lib/python3.12/site>
Jul 20 22:43:35 drache nvml-undervolt[86317]:   File "/usr/local/sbin/nvml-undervolt", line 154 in set_pstate>
Jul 20 22:43:35 drache nvml-undervolt[86317]:   File "/usr/local/sbin/nvml-undervolt", line 377 in main
Jul 20 22:43:35 drache nvml-undervolt[86317]:   File "/usr/local/sbin/nvml-undervolt", line 441 in <module>
Jul 20 22:43:35 drache systemd[1]: nvml-undervolt.service: Main process exited, code=dumped, status=11/SEGV
Jul 20 22:43:35 drache systemd[1]: nvml-undervolt.service: Failed with result 'core-dump'.

Problem seems to be with setting pstate clocks

       def set_pstate_clocks(handle, clock_type, clock_offset, target_pstates):
        for pstate in range(0, target_pstates + 1):
        struct = c_nvmlClockOffset_t()
        struct.version = nvmlClockOffset_v1
        struct.type = clock_type
        struct.pstate = pstate
        struct.clockOffsetMHz = clock_offset
        return nvmlDeviceSetClockOffsets(handle, struct)
jacklul commented 3 months ago

I would like to see what happens before it segfaults - try running with verbose mode turned on and with sudo (verbose mode does not output to systemd journal).

markhagemann commented 3 months ago

Seems to hang indefinitely with verbose mode.

➜ sudo python3 nvml-undervolt.py --core-offset 100 --target-clock 1725 --transition-clock 1500 --power-limit 285 --temperature-limit 72
Detected NVIDIA GeForce RTX 3080 Ti (GPU-a2cb5b35-c9ba-34eb-f3ae-1f4687448ffa)
Warning: Persistence mode is already enabled - make sure no other script is controlling clocks
Running main loop (sleep = 0.5)...
[1]    18719 segmentation fault  sudo python3 nvml-undervolt.py --core-offset 100 --target-clock 1725  1500
➜ sudo python3 nvml-undervolt.py --verbose --core-offset 100 --target-clock 1725 --transition-clock 1500 --power-limit 285 --temperature-limit 72
Namespace(env=None, index=0, uuid='', core_offset=100, memory_offset=0, target_clock=1725, transition_clock=1500, curve=False, curve_increment=0.0, clock_step=0.0, power_limit=285, temperature_limit=72, pstates=0, sleep=0.5, verbose=True, test=False)
Detected NVIDIA GeForce RTX 3080 Ti (GPU-a2cb5b35-c9ba-34eb-f3ae-1f4687448ffa)
Supported core clocks: [2100, 2085, 2070, 2055, 2040, 2025, 2010, 1995, 1980, 1965, 1950, 1935, 1920, 1905, 1890, 1875, 1860, 1845, 1830, 1815, 1800, 1785, 1770, 1755, 1740, 1725, 1710, 1695, 1680, 1665, 1650, 1635, 1620, 1605, 1590, 1575, 1560, 1545, 1530, 1515, 1500, 1485, 1470, 1455, 1440, 1425, 1410, 1395, 1380, 1365, 1350, 1335, 1320, 1305, 1290, 1275, 1260, 1245, 1230, 1215, 1200, 1185, 1170, 1155, 1140, 1125, 1110, 1095, 1080, 1065, 1050, 1035, 1020, 1005, 990, 975, 960, 945, 930, 915, 900, 885, 870, 855, 840, 825, 810, 795, 780, 765, 750, 735, 720, 705, 690, 675, 660, 645, 630, 615, 600, 585, 570, 555, 540, 525, 510, 495, 480, 465, 450, 435, 420, 405, 390, 375, 360, 345, 330, 315, 300, 285, 270, 255, 240, 225, 210]
Clock step is 15.0 MHz
Warning: Persistence mode is already enabled - make sure no other script is controlling clocks
Setting power limit to 285 W
Setting temperature limit to 72 C
Running main loop (sleep = 0.5)...
Disabling undervolt settings at P5 1500
Setting core offset to 0
Locking core clocks at 0 - 1500
Enabling undervolt settings at P0 1500
Locking core clocks at 1500 - 1725
Setting core offset to 100
Disabling undervolt settings at P5 1500
Setting core offset to 0
Locking core clocks at 0 - 1500
jacklul commented 3 months ago
Disabling undervolt settings at P5 1500
Setting core offset to 0
Locking core clocks at 0 - 1500

Is the card under load at this point? Perhaps your card just isn't stable at 1500MHz with +100 offset ?

markhagemann commented 3 months ago

Possibly but I've experimented with other values including the stable ones that work in Windows with Afterburner (will attach screenshot) such as 1410 and it doesn't make a difference in verbose mode. It all still seems to work though so I am quite content for the moment but will be interesting to if anyone else encounters it.

undervolt

jacklul commented 3 months ago

That segfault might also just be a bug in the NVIDIA's lib, that wouldn't be a surprise. My script is very simple - using the API as documented - so really nothing else comes to my mind right now...

Perhaps there is a difference in the API behavior between Windows and Linux - I tested the script on Windows only.

Edit: Tested it on Linux myself and it does indeed seems to happen, this has to be platform specific bug because I was able to keep the script running for over an hour while playing a game on Windows and it did it job fine.