Closed steffen-poulsen closed 3 years ago
Are you doing those tests with the last thruk release? There might have been some changes related to this. Waiting longer for LMD to respond should be pointless or least it simply covers the real issue. In order to determine LMD health status, Thruk sends a small request to the sites table. This table does not block nor does it involve any backend locks so it really should answer immediatly within milliseconds no matter what. In case that does not work, Thruk will send a USR1 signal to make LMD write a thread dump into its log to see if there might be race conditions or anything. In your case i only see the SIGINT but not the SIGUSR1 which is strange.
It looks like this is not a blazing fast restart, but just another LMD starting while the first one is still running. So there are simply two processes writing into the same logfile. On the other hand it is not possible to listen twice on the same socket, so it might be a kill -9 followed immediatly with the start of the next process.
Anyway, i think the real question is why does the health check fails when it seems like everything is fine.
Thanks for the input, it makes sense.
And, yeah, this is Thruk 2.32-3, so not the newest version. We will upgrade asap and test again.
There appears to be a correlation between the load coming in from Thruk and the frequency of the these restarts so far - at nighttime and now that it is weekend, everything appears to be stable.
Maybe I can generate some load on our test env and see if I can provoke this error there and try to understand it a bit better.
Having the #108 issue explained (thanks! :-)), next up for us is another similar issue, where the LMD process appears to just vanish without a trace.
It is almost like the LMD process is being hard killed and then restarted by Thruk - except, no trace of this happening is left in the thruk.log file.
The pattern in the log file goes like this:
The last message being the ordinary LMD startup message.
The startup message happens just split seconds after the latest entry of the ordinary activity, so the LMD process appears to be restarted blazingly fast.
Any ideas what might be the cause of this pattern?
Looking at this from the Thruk side of things, it doesn't look like Thruk is the cause of this.
The log doesn't mention that Thruk is killing the LMD process, or even restarting it.
The Thruk log in more detail.
From losing connection to LMD, to the LMD having been restarted, to all backends up again.