hilbert / hilbert-docker-images

Application with a dynamic choice of docker containers to run
Apache License 2.0
22 stars 8 forks source link

MKLivestatus reports critical error if there is no top app #22

Open elondaits opened 8 years ago

elondaits commented 8 years ago

Currently my query of MKLivestatus is failing because I'm getting the following report when I query

GET services
Columns: host_name state state_type plugin_output
Filter: description = dockapp_top1
OutputFormat: json
ColumnHeaders: on

response is

[["host_name","state","state_type","plugin_output"],
["supernova.mfo.de",2,1,"CRIT - CRITICAL - no running TOP app!"],
["vb-hb-test-1",0,1,"OK - TOP: hb_test_a@[{\"Dead\":false,\"Error\":\"\",\"ExitCode\":0,\"FinishedAt\":\"0001-01-01T00:00:00Z\",\"OOMKilled\":false,\"Paused\":false,\"Pid\":18299,\"Restarting\":false,\"Running\":true,\"StartedAt\":\"2016-08-  26T23:20:13.433016885Z\",\"Status\":\"running\"}]"],
["vb-hb-test-2",0,1,"OK - TOP: hb_test_b@[{\"Dead\":false,\"Error\":\"\",\"ExitCode\":0,\"FinishedAt\":\"0001-01-01T00:00:00Z\",\"OOMKilled\":false,\"Paused\":false,\"Pid\":16964,\"Restarting\":false,\"Running\":true,\"StartedAt\":\"2016-08-  26T23:20:05.60906792Z\",\"Status\":\"running\"}]"],
["vb-hb-test-3",0,1,"OK - TOP: hb_test_c@[{\"Dead\":false,\"Error\":\"\",\"ExitCode\":0,\"FinishedAt\":\"0001-01-01T00:00:00Z\",\"OOMKilled\":false,\"Paused\":false,\"Pid\":2367,\"Restarting\":false,\"Running\":true,\"StartedAt\":\"2016-08-26T23:21:01  .460639047Z\",\"Status\":\"running\"}]"],
["vb-hb-test-4",0,1,"OK - TOP: hb_test_b@[{\"Dead\":false,\"Error\":\"\",\"ExitCode\":0,\"FinishedAt\":\"0001-01-01T00:00:00Z\",\"OOMKilled\":false,\"Paused\":false,\"Pid\":2478,\"Restarting\":false,\"Running\":true,\"StartedAt\":\"2016-08-26T23:21:28  .003647053Z\",\"Status\":\"running\"}]"],
["vb-hb-test-5",0,1,"OK - TOP: hb_test_c@[{\"Dead\":false,\"Error\":\"\",\"ExitCode\":0,\"FinishedAt\":\"0001-01-01T00:00:00Z\",\"OOMKilled\":false,\"Paused\":false,\"Pid\":2865,\"Restarting\":false,\"Running\":true,\"StartedAt\":\"2016-08-26T23:22:43  .6794875Z\",\"Status\":\"running\"}]"],
["vb-hb-test-6",0,1,"OK - TOP: hb_test_a@[{\"Dead\":false,\"Error\":\"\",\"ExitCode\":0,\"FinishedAt\":\"0001-01-01T00:00:00Z\",\"OOMKilled\":false,\"Paused\":false,\"Pid\":2890,\"Restarting\":false,\"Running\":true,\"StartedAt\":\"2016-08-26T23:32:12  .044423391Z\",\"Status\":\"running\"}]"],
["vb-hb-test-7",0,1,"OK - TOP: hb_test_c@[{\"Dead\":false,\"Error\":\"\",\"ExitCode\":0,\"FinishedAt\":\"0001-01-01T00:00:00Z\",\"OOMKilled\":false,\"Paused\":false,\"Pid\":3274,\"Restarting\":false,\"Running\":true,\"StartedAt\":\"2016-08-26T23:22:59  .103161861Z\",\"Status\":\"running\"}]"],
["vb-hb-test-8",0,1,"OK - TOP: hb_test_a@[{\"Dead\":false,\"Error\":\"\",\"ExitCode\":0,\"FinishedAt\":\"0001-01-01T00:00:00Z\",\"OOMKilled\":false,\"Paused\":false,\"Pid\":3593,\"Restarting\":false,\"Running\":true,\"StartedAt\":\"2016-08-26T23:23:09  .743623052Z\",\"Status\":\"running\"}]"],
["vb-hb-test-9",0,1,"OK - TOP: hb_test_b@[{\"Dead\":false,\"Error\":\"\",\"ExitCode\":0,\"FinishedAt\":\"0001-01-01T00:00:00Z\",\"OOMKilled\":false,\"Paused\":false,\"Pid\":21852,\"Restarting\":false,\"Running\":true,\"StartedAt\":\"2016-08-  26T23:23:28.995734875Z\",\"Status\":\"running\"}]"]]

Notice the first row returned:

["supernova.mfo.de",2,1,"CRIT - CRITICAL - no running TOP app!"],

I consider that no top running app is a "normal(ish)" situation, not a critical error... it'd be better if the standard return could contemplate the fact that no app is running so I can parse critical errors as such and generate an error in the log.

PS: I'm parsing the return with a regexp, so we should have a well defined (documented, ideally) response... otherwise the regexp will break if something I've never seen before appears, like with this case.

malex984 commented 8 years ago

Well i think missing top GUI application must be a critical error on a station.

On a server i think we should monitor OMD server and your nodeJS back-end per heartbeat.

How do you get the host status? I will look into adding more Business Intelligence (see https://mathias-kettner.de/checkmk_bi.html) to reflect the framework health better...

elondaits commented 8 years ago

It's not a critical error from the viewpoint of MKLivestatus, so it shouldn't report it as such. MKLivestatus doesn't know about GUI applications, exhibits, etc... it's further down the path, in dockapp, where there's a configuration that states "what things should look like", that the status reported by MKLivestatus can be interpreted as an error. Case in point: Right now it's reporting a critical error that is not a critical error so I had to add a special case to my code saying "this is not a critical error, don't log it or throw an exception, and instead report no app is running"...

... If no top app is running in a station that's what I should report: "there's no app running here"... not shutdown the system because there's a critical error.

elondaits commented 8 years ago

PS: I get host info through

GET hosts
Columns: name state state_type
OutputFormat: json
ColumnHeaders: on
elondaits commented 8 years ago

To make it clearer: My current query to MKLivestatus means "Tell me which top apps are running in the stations and their status"... so "no app" is a valid answer I can see what to do with. It's better for me to have an answer I can parse and interpret instead of a "critical error" with a human readable message that could be anything.

It's as if you were querying a database with valid SQL and instead of getting an empty recordset because there are no matching records you got a critical error. Sometimes an empty result is a valid result.

malex984 commented 8 years ago

The idea is that OMD knows about the dockapp-specific checks and is supposed to derive the host status on it's own. Current logic is that there must be a single running GUI application on each station.

Note that supernova is currently commented out in the STATIONS/list - it should not be visible via your UI (at least by default). Also there may be other monitored agents at OMD (not listed in the list). But all the listed stations (excluding the commented-out) must be monitored via OMD under the same host-name (== station_id from their corresponding station.cfg)

elondaits commented 8 years ago

That assumes:

But AT LEAST I ask:

Give me an error code for "no top app" instead of a generic critical error. I can parse and interpret an error code. Right now I have a line in my code that reads:

    if (station.app_id === 'CRIT - CRITICAL - no running TOP app!') {
      // handle the case
    } 

and that's unacceptable for something new (not legacy) we're developing ourselves. It makes no sense I'm parsing an arbitrary human readable string to handle an error condition I can understand and work around and inform the user of, instead of having to say "There's something wrong... either it's on fire or no top app running".

malex984 commented 6 years ago

I am pretty sure that the critical state of a service is indicated via service check state... But it should also be possible to get the plugin exit codes or check the beginning of plugin output message for the following prefixes:

0 OK
1 WARN
2 CRIT 3 UNKN

dockapp_top1 (or hilbert_top1) can possibly have the following outputs:

OK - TOP: .... WARNING - \d+ TOPs: ... CRITICAL - no running TOP app! CRITICAL - cannot determine TOP app's info

elondaits commented 6 years ago

Here's a TL;DR for you:

The system starts and stops stations, so at some points in its normal operation stations won't have any top app running. This should not be informed as a critical error, because it's normal.

The state is informed through an ID, a prefix, and a human readable message. Currently I need to parse the third thing to know I must ignore the "critical" error. That's bad.