sni / lmd

Livestatus Multitool Daemon - Create livestatus federation from multiple sources
https://labs.consol.de/omd/packages/lmd/
GNU General Public License v3.0
42 stars 31 forks source link

Invalid request 'Filter: is_executing = 1' #118

Closed adoom42 closed 2 years ago

adoom42 commented 2 years ago

I installed Thruk & LMD a couple years ago and it has been working very well. Recently I upgraded Thruk to use the new API support and that went without a hitch. I encountered some problems when upgrading LMD.

For background, all servers run RHEL 7 or CentOS 7 with recent patches. The Thruk server connects to multiple Nagios servers at remote sites using stunnel (followed https://www.thruk.org/documentation/install.html#_tls-livestatus ). All Nagios instances are 4.4.6 and use check-mk-livestatus-1.4.0p31 (RPMs from the EPEL repo).

Before upgrading Thruk & LMD:

After upgrading Thruk & before upgrading LMD:

I thought the new version of Thruk may require a newer version of LMD, that's part of what led me to attempt upgrading LMD.

After upgrading LMD:

So it seems that upgrading LMD resolved the not implemented op: 7 problem, but introduced a new Filter: is_executing problem. I captured some livestatus queries that Thruk sends to Nagios and they run fine when executing them manually. I'm stumped as to why they all succeed when run by hand but sometimes fail when run by Thruk. The stunnel connections are fine (they use the exact same settings from the Thruk documentation).

As an exercise in trial & error, I tried all versions of LMD that shipped with major OMD releases between 2.7.0 and 4.4.0. It's pretty clear that the errors are tied to the versions since they come & go cleanly when the software is upgrade/downgraded.

Thruk Version LMD Version OMD Release lmd.log livestatus.log
2.2.0 1.3.0 2.7.0 No errors No errors
2.46 1.8.2 3.30 No errors No errors
2.46 1.9.2 3.40 Broken Broken
2.46 2.0.1 4.20 No errors Invalid request method
2.46 2.0.3 4.40 No errors Invalid request method

Note that LMD 1.9.2 didn't work at all. The stunnel connections showed peer is down: tls: server selected unsupported protocol version 301 errors for some reason. Upgrading/downgrading fixed that problem, only 1.9.2 was affected for some reason.

A guess is that the "Invalid request method" problem noted in livestatus.log started with LMD 2.x since that was a major version bump.

I looked into upgrading the check-mk-livestatus Nagios module but it appears that isn't distributed independently anymore (only available as source code or bundled with the main checkmk release). I also tried replacing check-mk-livestatus with naemon-livestatus, but Nagios wouldn't load the module (nagios[117094]: Error: Module '/usr/lib64/naemon/naemon-livestatus/livestatus.so' is using an old or unspecified version of the event broker API. Module will be unloaded.). While I could probably replace the whole Nagios app with Naemon, that would be a lot of work which I'd rather not tackle at this time.

Do you know what's going on or what other steps can be taken to gather more info? I'd like to get LMD 2.0.3 working with my Nagios 4.4.6 instances.

Thanks.

sni commented 2 years ago

great investigation. Thanks for that. A few remarks.

The error you see is because somehow a newline sliped in the query when updating services, ex.:

GET services
ResponseHeader: fixed16
OutputFormat: json
Columns: host_name description...
KeepAlive: on
Filter: last_check >= 1637073927
Filter: last_check < 1637073937
And: 2

Filter: is_executing = 1
Or: 2

I'll see where this comes from.

sni commented 2 years ago

Could you try the latest LMD? (Could also be extracted from tomorrows OMD nightly build)

adoom42 commented 2 years ago

I grabbed the lmd binary from omd-4.41.20211118-labs-edition-rhel7.x86_64.rpm and it worked without any errors. Thanks for the super-quick fix.