nan0s7 / nfancurve

A small and lightweight POSIX script for using a custom fan curve in Linux for those with an Nvidia GPU.
GNU General Public License v3.0
319 stars 58 forks source link

Issue running on Multi GPU system on Ubuntu 16.04 #1

Closed aryonoco closed 7 years ago

aryonoco commented 7 years ago

Hi,

FIrst of all, thanks for the script. It is exactly what I've been looking for.

Running it on a multi-GPU system on Ubuntu Xenial, I get the following output:

nan0s7's fan speed curve script

###################################

A likely supported driver version was detected. The fan curves match up! Good job! :D

Attribute 'GPUFanControlState' (madan:0[gpu:0]) assigned value 1.

Attribute 'GPUTargetFanSpeed' (madan:0[fan:0]) assigned value 25.

Attribute 'GPUTargetFanSpeed' (madan:0[fan:0]) assigned value 85.

./temp.sh: line 142: [: : integer expression expected ./temp.sh: line 148: [: : integer expression expected ./temp.sh: line 142: [: : integer expression expected ./temp.sh: line 148: [: : integer expression expected ./temp.sh: line 142: [: : integer expression expected ./temp.sh: line 148: [: : integer expression expected ./temp.sh: line 142: [: : integer expression expected ./temp.sh: line 148: [: : integer expression expected


And that continues until I quit the script.

My bash scripting is a bit rusty, but happy to help in any way I can.

Cheers

nan0s7 commented 7 years ago

Hi, thanks for the comment! I've been waiting for a guinea pig to test things on a multi-GPU setup ;P.

Something that'd be interesting to see is the output of the variables while the script is running. Can you go down to line 156 and un-comment that echo line. Then just run the script again and see what it does.

Line 142 is where it compares the difference in temperature to the curve values. Can you paste the output of the following two commands: nvidia-settings -q gpus and nvidia-settings -q fans That should tell us how NVIDIA handles multi-GPU setups, and hence how to fix this script.

Also I require your opinion on how I should handle this. Do you think I should make the script calculate the fan speed for each GPU individually, and allow each GPU to run fans at different speeds? Or do you think I should take the average of the two read-outs and just use the one fan speed for both GPU's? I may end up eventually just allowing the user to change which method the script will follow (sort of like a toggle switch or something), but I think for now we should worry about one of them so you and others running multi-GPU setups can at least make use of this script as soon as possible.

Thank you for your help and patience!

aryonoco commented 7 years ago

Thanks for the quick response @nan0s7 This is much easier than me dusting off my old O'Reilly Bash handbook and trying to debug :-)

As for your question about the method of calculating fan speed, I definitely think it's best to calculate it individually. I actually have 6 GPUs in this machine (it's a mining rig, just a scratch that needed to be itched) and I can see that there are considerable differences between the GPU temperatures. Depending on the location of the GPU and the airflow and the quality of the silicon, I can see differences of up to 7-8 degrees between the GPUs. So I think it would be best to calculate and set the fan speed individually.

I uncommented the echo on line 156 and this is the output

# nan0s7's fan speed curve script #
###################################

A likely supported driver version was detected.
The fan curves match up!
Good job! :D

  Attribute 'GPUFanControlState' (madan:0[gpu:0]) assigned value 1.

  Attribute 'GPUTargetFanSpeed' (madan:0[fan:0]) assigned value 25.

  Attribute 'GPUTargetFanSpeed' (madan:0[fan:0]) assigned value 70.

t=55 ot=55 sp=70 tdif=55 slp=3
t=55 ot=55 sp=70 tdif=0 slp=7
t=54 ot=55 sp=70 tdif=1 slp=7
t=53 ot=55 sp=70 tdif=2 slp=5
t=52 ot=55 sp=70 tdif=3 slp=5
t=51 ot=55 sp=70 tdif=4 slp=5
t=51 ot=55 sp=70 tdif=4 slp=5

  Attribute 'GPUTargetFanSpeed' (madan:0[fan:0]) assigned value 55.

t=50 ot=50 sp=55 tdif=0 slp=7
t=50 ot=50 sp=55 tdif=0 slp=7
t=50 ot=50 sp=55 tdif=0 slp=7
t=51 ot=50 sp=55 tdif=1 slp=7
t=51 ot=50 sp=55 tdif=1 slp=7
t=51 ot=50 sp=55 tdif=1 slp=7
t=51 ot=50 sp=55 tdif=1 slp=7
t=51 ot=50 sp=55 tdif=1 slp=7
t=51 ot=50 sp=55 tdif=1 slp=7
t=51 ot=50 sp=55 tdif=1 slp=7
t=52 ot=50 sp=55 tdif=2 slp=5
t=52 ot=50 sp=55 tdif=2 slp=5
t=52 ot=50 sp=55 tdif=2 slp=5
t=52 ot=50 sp=55 tdif=2 slp=5
t=52 ot=50 sp=55 tdif=2 slp=5
t=52 ot=50 sp=55 tdif=2 slp=5
t=52 ot=50 sp=55 tdif=2 slp=5
t=52 ot=50 sp=55 tdif=2 slp=5

So interestingly, this time it seemed to work, for GPU0. I didn't get the error for line 142 again.

nvidia-settings -q gpus
6 GPUs on madan:0

    [0] madan:0[gpu:0] (GeForce GTX 1070)

      Has the following names:
        GPU-0
        GPU-a2abac8b-4e4c-9973-72b7-6de414e3e6fa

    [1] madan:0[gpu:1] (GeForce GTX 1070)

      Has the following names:
        GPU-1
        GPU-909827fb-4c14-fd05-db00-ceb2f584c7e3

    [2] madan:0[gpu:2] (GeForce GTX 1070)

      Has the following names:
        GPU-2
        GPU-6cb7b538-db2d-2e8e-3fd2-621668366d09

    [3] madan:0[gpu:3] (GeForce GTX 1070)

      Has the following names:
        GPU-3
        GPU-63d02891-f60f-5f04-cefc-b9f7c7702dde

    [4] madan:0[gpu:4] (GeForce GTX 1070)

      Has the following names:
        GPU-4
        GPU-58acc987-4765-7920-a95e-b1c56a455b1a

    [5] madan:0[gpu:5] (GeForce GTX 1070)

      Has the following names:
        GPU-5
        GPU-9d639787-f681-1247-8c71-adcf62e5ab17
6 Fans on madan:0

    [0] madan:0[fan:0] (Fan 0)

      Has the following name:
        FAN-0

    [1] madan:0[fan:1] (Fan 1)

      Has the following name:
        FAN-1

    [2] madan:0[fan:2] (Fan 2)

      Has the following name:
        FAN-2

    [3] madan:0[fan:3] (Fan 3)

      Has the following name:
        FAN-3

    [4] madan:0[fan:4] (Fan 4)

      Has the following name:
        FAN-4

    [5] madan:0[fan:5] (Fan 5)

      Has the following name:
        FAN-5

The only other thing I would add, if you want the script to be as widely applicable as possible, is that some GPUs (Some Asus GTX 1070 and 1080 for example) have two fans per GPU. Mine only have 1 fan per GPU, so my machine is not going to be very useful in testing for that, but it's just something I thought you might want to have a think about.

Thanks for your work on this again

nan0s7 commented 7 years ago

Haha no problem, I just want this to work for everyone! :D

Thanks for your input! I have an idea on how to put this in so hopefully it shouldn't be too difficult.

Yeah I'm aware of some GPU's having more than one fan. With my current GPU - the GTX 1070 overclocked EVGA edition - there's 2 fans on it, but nvidia-settings treats them as one. However, I will add a function to count the number of fans so hopefully it's more future proof if NVIDIA decide to change how they count fans and such. Plus I'm still hoping to add nouveau support when they get up to speed and I'm assuming they count fans differently again so that'll help whenever that occurs.

No problem, I'm glad I'm not the only one getting some use out of it! Plus this is something I've been wanting to add to my script for a while, so I'm excited to finally get this working! :D

P.S. If you want a dodgy solution if you need this working in a hurry, you can make 6 copies of temp.sh, and in each file you modify the gpu variable to cover each GPU you have, and on line 122 change the [fan:0] part to match the fan for whatever GPU you wanna control, then just run all 6 versions of the script at the same time (make sure they are different names though like temp0.sh, temp1.sh, etc.) which should work how you want it to. Just something you can try if you want ;P

aryonoco commented 7 years ago

Thanks. I'm not in a hurry but I might try running the 6 versions together.

In the meantime, I'm now again getting those error messages:

###################################
# nan0s7's fan speed curve script #
###################################

A likely supported driver version was detected.
The fan curves match up!
Good job! :D

  Attribute 'GPUFanControlState' (madan:0[gpu:0]) assigned value 1.

  Attribute 'GPUTargetFanSpeed' (madan:0[fan:0]) assigned value 25.

  Attribute 'GPUTargetFanSpeed' (madan:0[fan:0]) assigned value 85.

t=56 ot=56 sp=85 tdif=56 slp=3
./temp0.sh: line 142: [: : integer expression expected
./temp0.sh: line 148: [: : integer expression expected
t=56 ot=56 sp=85 tdif=0 slp=7
./temp0.sh: line 142: [: : integer expression expected
./temp0.sh: line 148: [: : integer expression expected
t=55 ot=56 sp=85 tdif=1 slp=7
./temp0.sh: line 142: [: : integer expression expected
./temp0.sh: line 148: [: : integer expression expected
t=52 ot=56 sp=85 tdif=4 slp=7

It's actually not an error as it's still working.

nan0s7 commented 7 years ago

Yeah that error just really tells you that it's not changing the fan speed based on the temperature of the GPU. I'm curious, this makes me think that maybe with having more than one GPU that the output of some of the other nvidia-settings commands are slightly different.

If you're not busy, paste the outputs of the following commands: nvidia-settings -q=[gpu:0]/GPUCoreTemp -t nvidia-settings -a "[gpu:0]/GPUFanControlState=1" nvidia-settings -a "[fan:0]/GPUTargetFanSpeed=50"

With the last one, you don't have to put 50 as the new fan speed, it can be whatever you like. I just picked that at random. And try with different GPU and fan numbers if you like (you don't have to test all of the GPU's/fans though).

If those commands return anything about the other fans/GPU's then I can take that into account in the new version, but if they don't then we've found a hidden bug in my script... xP

nan0s7 commented 7 years ago

Oh and another question, would you prefer a independent fan curve for each GPU? Or just keep with the singular curve for every GPU? Of course each GPU can be at a different fan speed at any given time, but they'll all follow the same rules / do the same calculations.

aryonoco commented 7 years ago

One fancurve for all GPUs is fine. Doesn't need independent fancurves, in fact I think that makes it overly complicated without much benefit.

I'm running multiple versions of of the script with the same fancurve as per your suggestion and it's been working fine.

One problem I have however noticed is that on exit of the script, it sets gpu0 back to auto mode, not the gpu that I declared on line 12.

Here are the outputs you requested:

$ nvidia-settings -q=[gpu:0]/GPUCoreTemp -t
62
$ nvidia-settings -a "[gpu:0]/GPUFanControlState=1"

  Attribute 'GPUFanControlState' (madan:0[gpu:0]) assigned value 1.

$ nvidia-settings -a "[fan:0]/GPUTargetFanSpeed=50"

  Attribute 'GPUTargetFanSpeed' (madan:0[fan:0]) assigned value 50.

nvidia-settings -q=[gpu:2]/GPUCoreTemp -t
56
 nvidia-settings -a "[gpu:2]/GPUFanControlState=1"

  Attribute 'GPUFanControlState' (madan:0[gpu:2]) assigned value 1.

$ nvidia-settings -a "[fan:2]/GPUTargetFanSpeed=50"

  Attribute 'GPUTargetFanSpeed' (madan:0[fan:2]) assigned value 50.
nan0s7 commented 7 years ago

Yeah that's my bad I forgot about that; whoops not line 133, in the "set_fan_control" function, change the command to: nvidia-settings -a "[gpu:"$gpu"]/GPUFanControlState="$1

I hope that's what it kinda looked like before... I've already changed that function in the new version. :P

nan0s7 commented 7 years ago

Okay I think I've got a working version for you to try. Just paste this into a script and run it (I was going to use pastebin but I don't want people taking it from somewhere other than GitHub, so I apologise for the inconvenience). Apparently I can't upload a .sh file so I just did a .txt.

I am expecting an issue with the eles variable though, but we'll see what it says when you test it.

temp.txt temp.txt edit: I've spent some time reducing repeated code and hopefully optimised a few things during the process. I'm hoping this will be close to done providing everything goes well during your tests. :)

aryonoco commented 7 years ago

Thanks for the update.

Here is the output of the new script:

###################################
# nan0s7's fan speed curve script #
###################################

A likely supported driver version was detected.
The fan curves match up!
Good job! :D

  Attribute 'GPUFanControlState' (madan:0[gpu:1]) assigned value 1.

./temp_multi.sh: line 172: [: : integer expression expected

  Attribute 'GPUTargetFanSpeed' (madan:0[fan:0]) assigned value 0.

  Attribute 'GPUFanControlState' (madan:0[gpu:5]) assigned value 1.

Started process for n-GPUs and n-Fans

  Attribute 'GPUTargetFanSpeed' (madan:0[fan:0]) assigned value 40.

t=38 ot=38 sp=40 tdif=38 slp=3 gpu=0
./temp_multi.sh: line 83: [: : integer expression expected
./temp_multi.sh: line 86: 40 -  : syntax error: operand expected (error token is "-  ")

  Attribute 'GPUFanControlState' (madan:0[gpu:0]) assigned value 0.

  Attribute 'GPUFanControlState' (madan:0[gpu:1]) assigned value 0.

  Attribute 'GPUFanControlState' (madan:0[gpu:2]) assigned value 0.

  Attribute 'GPUFanControlState' (madan:0[gpu:3]) assigned value 0.

  Attribute 'GPUFanControlState' (madan:0[gpu:4]) assigned value 0.

  Attribute 'GPUFanControlState' (madan:0[gpu:5]) assigned value 0.

Fan control set back to auto mode.

Successfully caught exit & cleared variables!

At this point the script set all the GPUs back to auto mode and exited.

nan0s7 commented 7 years ago

Alright I made a mistake when initialising every GPU and stuff, so hopefully I've fixed that and that also fixes some of the other errors. Thanks for testing! :D

temp.txt EDIT: fixed a minor mistake temp.txt

aryonoco commented 7 years ago

Here's the output:

###################################
# nan0s7's fan speed curve script #
###################################

A likely supported driver version was detected.
The fan curves match up!
Good job! :D
Number of Fans detected: 6
Number of GPUs detected: 6

  Attribute 'GPUFanControlState' (madan:0[gpu:0]) assigned value 1.

  Attribute 'GPUFanControlState' (madan:0[gpu:5]) assigned value 1.

  Attribute 'GPUTargetFanSpeed' (madan:0[fan:0]) assigned value 25.

Started process for n-GPUs and n-Fans

  Attribute 'GPUTargetFanSpeed' (madan:0[fan:0]) assigned value 40

t=42 ot=42 sp=40 tdif=42 slp=3 gpu=0
./temp_multi.sh: line 81: [: : integer expression expected
./temp_multi.sh: line 84: 41 -  : syntax error: operand expected (error token is "-  ")

  Attribute 'GPUFanControlState' (madan:0[gpu:0]) assigned value 0.

  Attribute 'GPUFanControlState' (madan:0[gpu:1]) assigned value 0.

  Attribute 'GPUFanControlState' (madan:0[gpu:2]) assigned value 0.

  Attribute 'GPUFanControlState' (madan:0[gpu:3]) assigned value 0.

  Attribute 'GPUFanControlState' (madan:0[gpu:4]) assigned value 0.

  Attribute 'GPUFanControlState' (madan:0[gpu:5]) assigned value 0.

Fan control set back to auto mode.

Successfully caught exit & cleared variables!

At this point, the script set all GPUs back to auto and exited.

nan0s7 commented 7 years ago

Alright, I haven't tested this to see if this still works on 1 GPU but I don't see why it shouldn't (away from my NVIDIA machine atm).

temp.txt

aryonoco commented 7 years ago

Thanks.

Attached is the sample output.

At the end I had to manually exit the script as I was getting too many errors and wasn't confident that it was doing the right thing.

output.txt

nan0s7 commented 7 years ago

Bugger I must have uploaded the wrong version. On line 174 it says: seq 0 $2 Change that zero to a one. That should fix most of those errors. It'll at least hopefully show us the real problem.

Again, thanks for testing. Bash is one heck of a language to deal with.

aryonoco commented 7 years ago

Hey,

Attached is the sample run output.

I again manually exited the script at the end.

output.txt

nan0s7 commented 7 years ago

Wow that's embarrassing for me. That "fix" I made away from my NVIDIA computer was not right at all. I apologise for that brain-fart... hopefully I've fixed that and haven't introduced anymore problems now...

temp.txt

aryonoco commented 7 years ago

No need to apologise mate. I'm happy to test it until we get there in the end.

Here's the sample run output.

Seemed like the fan speed was stuck on 25% even though the temperatures were going up, so at that point I manually exited the script.

output.txt

nan0s7 commented 7 years ago

Shouldn't be too long now. I've fixed some more of my late-night errors. Hopefully now the fan speed should adjust... :D

EDIT: I just spent some time going through it and I've removed the part where during initialisation, it sets the fan speed to whatever is the first speed in the speed curve. Now it'll calculate the initial fan speed directly from the temperature when the script starts. Which also fixes an issue you had last time too. I also did a lot of minor optimisations.

New version: temp.txt (EDIT 2: Forgot a bit of redundant code, so re-uploaded new version)

Old version: temp.txt

aryonoco commented 7 years ago

Hmmm, so this is getting interesting!

I ran the "new" script a few times, and each time I got slightly different results.

On the first run, it assigned fan[5] to 85% but didn't change the other fans. Then fan[5] got stuck at 85% even when its temperature came down. It also complained about line 192 and 198.

Second run, it set fan[5] to the correct value, and was changing its fan speed correctly with temperature. However it did not change the other GPUs' fans. No complaints about line 192 and 198 this time.

Third run was also similar, but slightly different values. I thought I'd include it in case it can help you debug.

Each time I manually exited the script.

output1.txt output2.txt output3.txt

nan0s7 commented 7 years ago

Oh damn. That is very interesting.

Well this time I don't actually see something wrong with the code... so I've tried something. In the first version, I've just changed some minor things and I've also added a lot more debugging information so hopefully it'll narrow down where the bug is. In the second version, I've taken the first version and I've changed some things related to one of the variables that wasn't GPU-dependent. Which is something I was concerned about being an issue in a previous version. So I'm looking forward to see what happens now... :P

temp.txt temp2.txt EDIT: Forgot to change one instance of the old variable, now fixed.

aryonoco commented 7 years ago

Alright, here is the output.

Note that these runs are all with your default fancurve and I have not changed anyting in any of the runs.

=================== Temp1: Run1: fan5 set to 85% but didn't change fan0 to fan4 even though their temperature was high.

Run2: fan1 and fan2 at 100% and didn't reduce speed even as the temperature came down. fan0 got stuck on 85%, and it didn't seem to change fan3 and fan5 at all.

Run3: fan0 stuck on 70% even when t came down fan1 stuck at 55% fan2 stuck at 85% even as the t came down fan3 stuck on 32% even as t went up fan4 stuck on 30% even as t went up fan5 stuck on 29% even as t went up

Temp2:

Run1: Fan0 to fan4 set to low speeds as the t went up. Fan 5 set to 85% and didn't change. Eventually, it kicked up the speed for fan0 fan1 fan 3 and fan 4. Fan 2 stayed at 38% even as the t went up. Fan5 stayed at 85% throughout.

Run2: Again fan5 is set to 85% while fan0 to fan4 are stuck on thirty something percent. Fan4 seems to be working, goes to 100% when the t hits 75 and then comes down. fan2 is stuck on 38% even as t hits 77.

Run3: fan2 is stuck at 100% to begin with. Eventually it seems like it is working, but the behaviour is erratic and doesn't follow the fancurve. It goes from 38% suddenly to 100%. And then it doesn't bring the speed down when the t comes down.

==============

I really can't figure out what's happening. Every single run is different. I know this is a mess to debug.

temp1run1.txt temp1run2.txt temp1run3.txt temp2run1.txt temp2run2.txt temp2run3.txt

nan0s7 commented 7 years ago

That's all good. Looking through the logs it seems like some of the variables are becoming equal to a null value. I think there's something weird going on with one of the variables.

I've changed a couple of minor things with the first attached script, which I suspect may have been causing some sort of chain reaction in errors. In the second script, is a little more exciting in my opinion. While I was working through the bugs from last time I thought of a new method of calculating the new fan speed; the new method is not only cleaner to work with, it's also a bit faster. This also means I can get rid of some of the variables that I didn't really enjoy putting in the first version. The new script seems to be working perfectly on my system, so I'm curious to see how it works on yours.

Hopefully this time the results will be a little more consistent... :D

temp.txt temp2.txt

aryonoco commented 7 years ago

Hey,

So the first script doesn't work. Again, the same problem, fans getting stuck in the wrong place etc.

But...

The second one works beautifully!

I first checked it with the default fancurve. Looked promising. Then I customised it with my own fancurve which has a lot more points in it, expecting it to fail somewhere. But no! It works beautifully.

The CPU overhead seems lower as well to boot.

Here's the output of the run for the past few hours for you to enjoy your beautiful handiwork (careful, the file is over 4MB, make sure your text editor can handle it...)

Excellent job and thanks again.

output.txt

nan0s7 commented 7 years ago

Holy crepe!!

I was not expecting to hear such good news!

Going through small sections of the log you sent, it seems like the logic and the variables are all what you'd expect them to be so it doesn't seem like it's just faking it... ;P

Well enjoy the beta version! I'm going to tweak it some more over the next day or so so make it cleaner and possibly slightly more efficient. So if you're happy for one more test, then I'll post it here for you to try before I push it to master just to make sure I didn't break anything in my excitement. :D

No, thank you for testing this for me! I wasn't expecting this to happen (the request and such) so I was prepared to wait a while before I upgrade my computer (which'll be in a year or so...), and hence I am quite happy I was able to check this off my wish-list sooner rather than later.

So don't hesitate to message me or create another issue report for a feature request or anything in the future, and good luck with your mining!

aryonoco commented 7 years ago

24 hours later, still working beautifully.

Absolutely would love to test your new/cleaner version :-)

nan0s7 commented 7 years ago

Alright I've stared at the code for long enough now to know I can't think of many more ways to make it quicker/more efficient. So here is the version I'll likely be pushing to master if you don't find anything wrong with it in your testing.

I still have the debugging information on, but of course in the actual release I'll comment it out which should reduce CPU usage by another marginal amount.

At the very least I reduced the size of the script (282 lines right now I believe) and that translates into less potential memory being used.

temp.txt

EDIT: I actually did find a small optimisation I could do, so that is what is changed in this newer version. But it's only really for people with 1 GPU. Still wanna make sure it works for everyone though! temp.txt

Let me know if you notice anything weird! :D

aryonoco commented 7 years ago

I've just done a quick test of the last version that you posted that it seems to be working fine.

I'll now load my own fancurbe and run it for a day or so and will report how it goes.

Here is a sample output.

output.txt

Edit: Oh I can now edit the sleep times as well. Nice!

aryonoco commented 7 years ago

OK, maybe a bug here somewhere.

I noticed this morning that one of the fans was set to 0% . Fortunately the temperature still had not gone up too much.

It looked like Xorg was using 100% of the CPU. Maybe something to do with nvidia-settings as I know it uses that.

Setting fans to auto removed the Xorg as a resource hog.

Not sure if the bug is in this script or somewhere else. Will try and investigate more.

Edit:

Ok, running the script again:

###################################
# nan0s7's fan speed curve script #
###################################

A likely supported driver version was detected.
The fan curves match up!
Good job! :D
Number of Fans detected:
5
Number of GPUs detected:
6

  Attribute 'GPUFanControlState' (madan:0[gpu:0]) assigned value 1.

  Attribute 'GPUFanControlState' (madan:0[gpu:1]) assigned value 1.

  Attribute 'GPUFanControlState' (madan:0[gpu:2]) assigned value 1.

  Attribute 'GPUFanControlState' (madan:0[gpu:3]) assigned value 1.

  Attribute 'GPUFanControlState' (madan:0[gpu:4]) assigned value 1.

  Attribute 'GPUFanControlState' (madan:0[gpu:5]) assigned value 1.

Submit an issue on my GitHub page... happy to fix this :D

  Attribute 'GPUFanControlState' (madan:0[gpu:0]) assigned value 0.

  Attribute 'GPUFanControlState' (madan:0[gpu:1]) assigned value 0.

  Attribute 'GPUFanControlState' (madan:0[gpu:2]) assigned value 0.

  Attribute 'GPUFanControlState' (madan:0[gpu:3]) assigned value 0.

  Attribute 'GPUFanControlState' (madan:0[gpu:4]) assigned value 0.

  Attribute 'GPUFanControlState' (madan:0[gpu:5]) assigned value 0.

Fan control set back to auto mode.
Successfully caught exit & cleared variables!

Obviously an issue with one of the nvidia tools.

nan0s7 commented 7 years ago

Oh that's really weird.

When you see this happen, try getting the output of nvidia-settings -q fans and nvidia-settings -q gpus. I think the NVIDIA drivers were updated the other day so it could be that the way my script extracts the number from the whole output could be a bit off now.

I'll see if I can find anything online and I'll look through my code again to see if I made an error.

EDIT: Actually don't worry about those commands, I've added them to the end of the script. So next time the number of fans don't equal the number of gpus detected it'll print the output of those two nvidia-settings commands.

I've done some moving around and editing, mainly to make sure that there aren't any cross-contamination in the variable values and such, and I've also made it so you can't have two scripts running at the same time (don't really need that anymore since multi-GPU support is technically working for the most part).

Here's the latest version: temp.txt

EDIT2: I haven't added this version, but I've added the ability for the latest version of the script to kill the previous/already running version of the script. So now instead of exiting it'll kill the old process. Just thought I'd add that :P

aryonoco commented 7 years ago

Testing it now. Will let you know how it goes.

Killing an existing running script is an excellent idea. I've had multiple versions running at the same time by accident, and had to kill them manually. This makes sure that doesn't happen again!

aryonoco commented 7 years ago

Have not been able to replicate the issue again. It's running 24/7 and working very well for the time being.

Feel free to post the latest version for me to test if you're feeling good about it.

nan0s7 commented 7 years ago

I dunno if that's a good or bad thing that you can't recreate the bug again... :P

Hopefully it's gone but let me know if it comes back or whatever.

Here's the version I'm planning on pushing to master now. It can only kill one already running version of the script right now, but I'm planning on fixing that later on with a separate script. I'll also add this functionality (killing all running processes of the script) to the update script when I've got it working well. But I think that can wait for a future version to give myself some time to make it better.

temp.txt

aryonoco commented 7 years ago

Testing now. Will report how it goes.

aryonoco commented 7 years ago

So it's been running for a couple of days 24/7 now, and I haven't had any issues.

civyshk commented 7 years ago

Hey @aryonoco, I'm developing a similar (python) script and I'd also like to add multi-gpu support. It would be very helpful if you post the output of both of these commands:

nvidia-settings -q fans
nvidia-settings -q gpus

Edit: Oh, I skipped miserably the first posts where this is already asked and answered. Thanks.

nan0s7 commented 7 years ago

@aryonoco awesome I'm going to push the update to master then, again thanks for your help! :D

@civyshk he's already done this for me, it may be helpful to you if you read the whole of this thread :)