vmatare / thinkfan

The minimalist fan control program
GNU General Public License v3.0
541 stars 62 forks source link

final work on libsensors support #163

Closed vmatare closed 2 years ago

koutheir commented 2 years ago

I've been using thinkfan with LM sensors support over a few days, and I noticed that it fails to continue operation after waking up from sleep. This is an excerpt of the systemd journal:

... systemd[1]: Starting Notify thinkfan of imminent sleep...
... thinkfan[3459640]: Going to sleep: Will allow sensor read errors for the next 4 loops.
... systemd[1]: thinkfan-sleep.service: Deactivated successfully.
... systemd[1]: Finished Notify thinkfan of imminent sleep.
... systemd[1]: Reached target Sleep.
... systemd[1]: Starting System Suspend...
... systemd-sleep[3461338]: Entering sleep state 'suspend'...
... kernel: PM: suspend entry (deep)
...
... kernel: Restarting tasks ... 
... kernel: pci_bus 0000:05: Allocating resources
... kernel: done.
... systemd-resolved[196853]: Clock change detected. Flushing caches.
...
... thinkfan[3459640]: ERROR: A sensor has vanished! Exiting since there's no safe way of handling this.
... systemd[1]: thinkfan.service: Main process exited, code=exited, status=1/FAILURE
... systemd[1]: thinkfan.service: Failed with result 'exit-code'.
... kernel: thermal thermal_zone9: failed to read out thermal zone (-61)
...
... kernel: PM: suspend exit
...
... systemd-sleep[3461338]: System returned from sleep state.
... systemd[1]: systemd-suspend.service: Deactivated successfully.
... systemd[1]: Finished System Suspend.
... systemd[1]: Stopped target Sleep.
... systemd[1]: Reached target Suspend.
... systemd[1]: Starting Reload thinkfan after waking up from suspend...
...
... systemd[1]: Stopped target Suspend.
...
... systemd[1]: thinkfan-wakeup.service: Main process exited, code=exited, status=1/FAILURE
... systemd[1]: thinkfan-wakeup.service: Failed with result 'exit-code'.
... systemd[1]: Failed to start Reload thinkfan after waking up from suspend.

Also, the service status is:

$ sudo systemctl status thinkfan
x thinkfan.service - Thinkfan, the minimalist fan control program
     Loaded: loaded (/usr/local/lib/systemd/system/thinkfan.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/thinkfan.service.d
             `-default.conf, override.conf
     Active: failed (Result: exit-code) since Fri 2021-10-29 10:30:11 EDT; 4min 31s ago
    Process: 3459639 ExecStart=/usr/local/sbin/thinkfan $THINKFAN_ARGS (code=exited, status=0/SUCCESS)
   Main PID: 3459640 (code=exited, status=1/FAILURE)
        CPU: 156ms

... thinkfan[3459640]: Temperatures(bias): 57(0), 76(0), 65(0), 64(0), 66(0), 75(0), 63(0), 65(0), 65(0), 63(0), 36(0), 35(0), 45(0), 38(0), 69(0), 69(0), 57(0), 70(0), 0(0), 58(0), 63(0), 67(0) -> Fans: level 6
... thinkfan[3459640]: Temperatures(bias): 57(0), 66(0), 66(0), 63(0), 62(0), 63(0), 60(0), 62(0), 61(0), 60(0), 37(0), 35(0), 45(0), 38(0), 68(0), 68(0), 57(0), 71(0), 0(0), 59(0), 64(0), 66(0) -> Fans: level 5
... thinkfan[3459640]: Temperatures(bias): 57(0), 80(0), 80(0), 67(0), 71(0), 74(0), 64(0), 68(0), 71(0), 64(0), 37(0), 35(0), 45(0), 38(0), 70(0), 70(0), 58(0), 72(0), 0(0), 60(0), 65(0), 68(0) -> Fans: level 7
... thinkfan[3459640]: Temperatures(bias): 57(0), 67(0), 67(0), 64(0), 65(0), 64(0), 63(0), 64(0), 64(0), 63(0), 37(0), 35(0), 45(0), 38(0), 69(0), 69(0), 58(0), 73(0), 0(0), 60(0), 64(0), 67(0) -> Fans: level 5
... thinkfan[3459640]: Temperatures(bias): 59(0), 76(0), 76(0), 67(0), 69(0), 70(0), 64(0), 68(0), 68(0), 65(0), 39(0), 35(0), 45(0), 35(0), 66(0), 66(0), 57(0), 72(0), 0(0), 61(0), 65(0), 70(0) -> Fans: level 6
... thinkfan[3459640]: Temperatures(bias): 58(0), 66(0), 66(0), 63(0), 65(0), 64(0), 62(0), 63(0), 63(0), 62(0), 39(0), 35(0), 45(0), 35(0), 67(0), 67(0), 58(0), 72(0), 0(0), 61(0), 66(0), 69(0) -> Fans: level 5
... thinkfan[3459640]: Going to sleep: Will allow sensor read errors for the next 4 loops.
... thinkfan[3459640]: ERROR: A sensor has vanished! Exiting since there's no safe way of handling this.
... systemd[1]: thinkfan.service: Main process exited, code=exited, status=1/FAILURE
... systemd[1]: thinkfan.service: Failed with result 'exit-code'.

Restarting the service restores normal operation:

$ sudo systemctl restart thinkfan

…/data/oracle-repo/P111900-llvm on  microdoc [!?⇡] via ☕ v11.0.13 
$ sudo systemctl status thinkfan
* thinkfan.service - Thinkfan, the minimalist fan control program
     Loaded: loaded (/usr/local/lib/systemd/system/thinkfan.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/thinkfan.service.d
             `-default.conf, override.conf
     Active: active (running) since Fri 2021-10-29 10:35:11 EDT; 1s ago
    Process: 3461927 ExecStart=/usr/local/sbin/thinkfan $THINKFAN_ARGS (code=exited, status=0/SUCCESS)
   Main PID: 3461928 (thinkfan)
      Tasks: 1 (limit: 154168)
     Memory: 644.0K
        CPU: 11ms
     CGroup: /system.slice/thinkfan.service
             `-3461928 /usr/local/sbin/thinkfan -b0

... systemd[1]: Starting Thinkfan, the minimalist fan control program...
... thinkfan[3461927]: Daemon PID: 3461928
... systemd[1]: Started Thinkfan, the minimalist fan control program.
... thinkfan[3461928]: Temperatures(bias): 61(0), 73(0), 72(0), 68(0), 73(0), 68(0), 66(0), 68(0), 66(0), 65(0), 39(0), 37(0), 45(0), 44(0), 69(0), 69(0), 57(0), 74(0), 1(0), 62(0), 66(0), 70(0) -> Fans: level 5

Do you have an idea what could be causing this?

vmatare commented 2 years ago

@koutheir I'm currently reviewing the runtime behavior of your code. My general impression is very good, apart from some minor details and the fact that I'd like to generalize the whole thing to support libsensors fans. But I'd be more interested to know how easy it was for you to read into the architecture and come up with a design idea for this feature. You obviously did the right thing (and very quickly btw), but was there anything in the code where you went e.g. "wtf, WHY", or any moment you didn't know where to look for something?

koutheir commented 2 years ago

... and the fact that I'd like to generalize the whole thing to support libsensors fans.

I think it's a good idea.

But I'd be more interested to know how easy it was for you to read into the architecture and come up with a design idea for this feature.

It took me a few days of playing with libsensors API (with some browsing of its source code), but given that I only worked on this in my spare time (I work during the day), it spanned one or two weeks.

but was there anything in the code where you went e.g. "wtf, WHY", or any moment you didn't know where to look for something?

This happened to me when I was implementing the YAML configuration parsers. It wasn't obvious to me how the YAML library worked. The existing parsers and my IDE helped, though.

I didn't get the chance to read all thinkfan's source code, but from what I read, I didn't see anything alarming or requiring expensive redesign.

vmatare commented 2 years ago

I'm now also running libsensors on my main system, and it all looks good so far. I've not noticed any issues after waking up from suspend. Do you always get ERROR: A sensor has vanished! ... after waking up or does it happen only sporadically? I can tell you that this message means that somehow one of the Sensor::read_temps() methods must have returned less temperatures than it should have, but without throwing an exception...

koutheir commented 2 years ago

Do you always get ERROR: A sensor has vanished! ... after waking up or does it happen only sporadically?

I never saw this behavior sporadically, but I noticed it consistently after wake up from sleep.

koutheir commented 2 years ago

I added a mechanism to retry reading LM sensors for, at most, 30 times during 30 seconds, before reporting failure to the main thinkfan loop.

On my computer, this needs between 5 and 6 seconds (and retries) to successfully read LM temperature sensors after waking up from sleep.

I don't know if this approach is okay with thinkfan. If you think of another approach we could fix this, then please tell me.

vmatare commented 2 years ago

I added a mechanism to retry reading LM sensors for, at most, 30 times during 30 seconds, before reporting failure to the main thinkfan loop.

On my computer, this needs between 5 and 6 seconds (and retries) to successfully read LM temperature sensors after waking up from sleep.

I don't know if this approach is okay with thinkfan. If you think of another approach we could fix this, then please tell me.

Well now that I think about it, the best idea might be to generalize the behaviour because it could address some wakeup issues we're seeing in other scenarios as well...

koutheir commented 2 years ago

Well now that I think about it, the best idea might be to generalize the behaviour because it could address some wakeup issues we're seeing in other scenarios as well...

I didn't try to generalize the implementation because I don't have a good enough understanding of thinkfan to figure out the best place where this behavior should be implemented.

vmatare commented 2 years ago

I didn't try to generalize the implementation because I don't have a good enough understanding of thinkfan to figure out the best place where this behavior should be implemented.

I'm having to think about it, too, because we have the optional and tolerate_errors on wakeup logic to interfere with it^^

koutheir commented 2 years ago

I'm having to think about it, too, because we have the optional and tolerate_errors on wakeup logic to interfere with it

If I understand optional and tolerate_errors well, then the approach I implemented has different semantics.

Unlike optional, the sensor value must arrive at some point before the maximum time runs out, otherwise the driver reports an error.

Unlike tolerate_errors, (1) failure to read a sensor value causes a pause before the next attempt to read the value again, and (2) the multiple attempts are always enabled, irrespective of signals delivered to thinkfan.

vmatare commented 2 years ago

Do you think it's important to retry one sensor in a loop with 1 Hz? Because normally I'd go and put this logic in the regular sleep cycle. Then it would be retried on every loop (5s or whatever) up to a maximum # of attempts. Might take a few seconds longer for a slow sensor to get picked up though...

koutheir commented 2 years ago

Do you think it's important to retry one sensor in a loop with 1 Hz? Because normally I'd go and put this logic in the regular sleep cycle. Then it would be retried on every loop (5s or whatever) up to a maximum # of attempts. Might take a few seconds longer for a slow sensor to get picked up though...

It depends on what is happening while the sensor value cannot be read:

The reason I selected 1 Hz was just out of an abandon of caution. I only tested my implementation on one hardware sample (my laptop), so I had to be paranoid. If you have experience with more diverse hardware, then you probably can select a better frequency.

For the case (a), it really doesn't matter if thinkfan picks up the value even 10 seconds later. It's less convenient, but it ultimately doesn't matter.

For the case (b), what matters is that (1) the delay must not be too long, and (2) the delay must be time-bounded. I realize that (2) is by definition real-time behavior, and I know Linux isn't a real-time kernel, but getting close enough should be okay.

koutheir commented 2 years ago

@vmatare, are you waiting for a review from me before merging this PR?

vmatare commented 2 years ago

@vmatare, are you waiting for a review from me before merging this PR?

Sorry, no, I was busy with other stuff, so my whole work on the generalized retry logic got dragged out a bit. But I'm finishing it up as we speak. I'm about to give it a few tests and some review before pushing it, but once it's up I can definitely use some review and help with testing.

vmatare commented 2 years ago

So... after playing around with #177 for a bit, I think I'd like to merge this PR, but without the libsensors-specific retry logic you put in. I'm in the process of implementing it "completely" in #177, but it's still not finished. As you can tell if you take a look at https://github.com/vmatare/thinkfan/pull/177/commits/0bfb2b903b3945b92fad5a38953a04a73aacf941, to really make it work nicely I had to redesign the way temperature reading and sensor initialization works. What's still missing is some refactoring in the way the config is read, because that also relies on all fans/sensors being ready at startup.

The reason I'm going this long route is because I wanna do it properly or not at all. The thing is, if a user marks a sensor (or a fan in fact) as optional (or sets some max-errors > 0), they expect thinkfan to function 100% without it, and thinkfan must be able pick it up transparently at any time, even if it's not present at startup.

So if you're fine with that, I'd cut the retry logic out of the lm-sensors branch and force push it. You could then formally approve it and we'd finally have it merged. Hopefully I'll be able to finish #177 soon ;-)

koutheir commented 2 years ago

I think I'd like to merge this PR, but without the libsensors-specific retry logic you put in.

The retry logic is necessary to make thinkfan useful on my hardware, in the first place.

The reason I'm going this long route is because I wanna do it properly or not at all. The thing is, if a user marks a sensor (or a fan in fact) as optional (or sets some max-errors > 0), they expect thinkfan to function 100% without it, and thinkfan must be able pick it up transparently at any time, even if it's not present at startup.

I agree, but this kind of fault tolerance requires meticulous logic.

So if you're fine with that, I'd cut the retry logic out of the lm-sensors branch and force push it.

As long as no other retry logic is implemented, I disagree with this removal. I suggest delaying this merging, until another form of retry logic is implemented.

koutheir commented 2 years ago

@vmatare, did you get a chance to implement a retry logic?

vmatare commented 2 years ago

Oh yeah, I just merged it into the master. The basic idea is in https://github.com/vmatare/thinkfan/commit/0bfb2b903b3945b92fad5a38953a04a73aacf941 and the following commits mostly round it out. Do give it a try if you find the time. You'll have to set either optional = true or max_errors = X on the sensors you want to be retried. Documentation is still missing, but I hope to get around to it this weekend.