openv / vcontrold

:fire: vcontrold Daemon for control and logging of Viessmann® type heating devices
https://github.com/openv/openv/wiki
GNU General Public License v3.0
103 stars 58 forks source link

vcontrold seems to crash if it receives a signal at the same time a request is processed #141

Open l3u opened 4 months ago

l3u commented 4 months ago

Hi all,

I'm not completely sure what causes this yet. I changed two things recently: I switched the machine vcontrold runs on from Devuan to Alpine, which is built upon musl and not glibc. However, I'm pretty sure this does not cause the issue, because:

The other thing I changed was: I noticed the vcontrold log to grow big, as it ran for years and I never cared about the log file. So I set up a logrotate script. I thought it was possibly a good idea to restart the daemon via a postrotate script.

What happened was: As of today, for the second time now, vcontrold crashed. I run it since 2015 and I never-ever experienced a single crash. First time, I thought this could be due to Alpine/musl, restarted the daemon and moved on. But today, I has a closer look. The crash happened at the very time the logrotate script ran, after the script would stop the daemon. The log contained "Received SIGTERM" as the last entry. Then, the init system (OpenRC) would not restart the daemon anymore due to it being crashed.

And the time this happened could be the very point in time where my other automated scripts polled the daemon for data. Which possibly works in most of cases, because it's not the very second the request is processed, but those two times it crashed, it may have been really simultaneously.

I thus concluded that maybe, the daemon crashes if some request is processed, and whilst this happens, the daemon receives a signal. Could this be the case? I have write access here and back then, I revised the build system – but I'm not a C pro and I'm not really into such deep stuff like daemon programming. So I fear I can't say much about the code … It would be nice if someone with more insight of this could have a look at the sources and check if we possibly have a bug here.

Apart from that, I now try to simply copytruncate the log and not restart the daemon via logrotate to see if this works – after all, there seems to be no real reason to restart it in the first place. But If we actually have a problem with concurrency, I think we should fix it. And I know that things that happen at the same time through different means can cause a lot of trouble ;-)

Thanks a lot!

Cheers, Tobias

l3u commented 4 months ago

Additional info:

The log when the daemon crashed looks like this:

[32153] Wed Jul 10 02:00:04 2024 : Client connected 127.0.0.1:45842 (FD:1)
[32153] Wed Jul 10 02:00:04 2024 : Received SIGTERM
[1072] Wed Jul 10 02:00:04 2024 : Received SIGTERM

Then I started it manually again, and it continues with:

[1404] Wed Jul 10 07:11:02 2024 : TCP socket 0.0.0.0:3002 opened
[1404] Wed Jul 10 07:15:03 2024 : Client connected 127.0.0.1:48592 (FD:1)
[1404] Wed Jul 10 07:20:03 2024 : Client connected 127.0.0.1:39668 (FD:1)
...

So apparently, it got a connection, then SIGTERM (by the logrotate script), and then, it logged another SIGTERM with already another PID, which should be the actual crash. All in the very same second.

The other logs with no crash look like this:

[30854] Tue Jul  9 02:00:01 2024 : Received SIGTERM
[32153] Tue Jul  9 02:00:04 2024 : TCP socket 0.0.0.0:3002 opened
[32153] Tue Jul  9 02:05:03 2024 : Client connected 127.0.0.1:50344 (FD:1)
[32153] Tue Jul  9 02:10:03 2024 : Client connected 127.0.0.1:58136 (FD:1)
...

and a day before:

[29536] Mon Jul  8 02:00:02 2024 : Received SIGTERM
[30854] Mon Jul  8 02:00:05 2024 : TCP socket 0.0.0.0:3002 opened
[30854] Mon Jul  8 02:05:03 2024 : Client connected 127.0.0.1:35112 (FD:1)
[30854] Mon Jul  8 02:10:03 2024 : Client connected 127.0.0.1:40382 (FD:1)
...

So they start with the "Received SIGTERM" – and not with a client connection.

I hope this helps …