sni / lmd

Livestatus Multitool Daemon - Create livestatus federation from multiple sources
https://labs.consol.de/omd/packages/lmd/
GNU General Public License v3.0
42 stars 31 forks source link

Incremental corruption #111

Closed alexstaz closed 3 years ago

alexstaz commented 3 years ago

Hello,

We use LMD to attack 1 nagios and 6 Icinga 2 system. It seems the incremental update of status is doing some internal corruption of the cache. We see that some services get their status and all other information corrupted by another service, like plugin_output, status, etc ... For example : [host_name,display_name,plugin_output] ["xxx-yyy-bo02","HYCU Last Backup","Updates: 3 critical, 2 optional"]

If we wait some time (full update) or restart we have the good information : ["xxx-yyy-bo02","HYCU Last Backup","Status=OK, Compliancy=GREEN, Date=2021-02-25T07:04:05.355000"]

It happens more with system that have some long delay to respond (we have around 250ms of latency between lmd server and Icinga). Never seen on Nagios system. And also the system that we have most of the time the problem is in High Avaibility (2 icinga connection declared in Lmd)

I tried to look at the code, but quite difficult for me. If you need more information, please tell me.

Thanks a lot.

sni commented 3 years ago

That is a known issue with icinga2 but should be fixed meanwhile. Are you using the latest version?

sni commented 3 years ago

Sounds like the issue solved in e0e8ea1e2d651d6ff22d4b9ec29426d0a8865575

alexstaz commented 3 years ago

I have tested with lmd 1.9.0 and 1.9.5 and HEAD. Icinga release are 2.12.3. All peer have flags Icinga2 (flags = ['Icinga2']) Any ideas ? I have put logs in debug, but don't help too much

sni commented 3 years ago

No idea so far. I mean, there are known issues with Icinga2, but there should be workarounds in place as already mentioned. I would have to setup a test environment to see if i am able to reproduce that somehow.

alexstaz commented 3 years ago

Hello @sni, We dig into more testing, and if we deactivate the full it's ok. So it's not the incremental that doesn't work, but the full I think. We changed to FullUpdateInterval = 600 to FullUpdateInterval = 0. I hope it can helps.

sni commented 3 years ago

thanks, that surely helps identifying the root cause.

sni commented 3 years ago

please try again, should be better now

alexstaz commented 3 years ago

Thanks a lot. We use it since 24h and everything seems fine now.