Open elondaits opened 8 years ago
Well i think missing top GUI application must be a critical error on a station.
On a server i think we should monitor OMD server and your nodeJS back-end per heartbeat.
How do you get the host status? I will look into adding more Business Intelligence (see https://mathias-kettner.de/checkmk_bi.html) to reflect the framework health better...
It's not a critical error from the viewpoint of MKLivestatus, so it shouldn't report it as such. MKLivestatus doesn't know about GUI applications, exhibits, etc... it's further down the path, in dockapp, where there's a configuration that states "what things should look like", that the status reported by MKLivestatus can be interpreted as an error. Case in point: Right now it's reporting a critical error that is not a critical error so I had to add a special case to my code saying "this is not a critical error, don't log it or throw an exception, and instead report no app is running"...
... If no top app is running in a station that's what I should report: "there's no app running here"... not shutdown the system because there's a critical error.
PS: I get host info through
GET hosts
Columns: name state state_type
OutputFormat: json
ColumnHeaders: on
To make it clearer: My current query to MKLivestatus means "Tell me which top apps are running in the stations and their status"... so "no app" is a valid answer I can see what to do with. It's better for me to have an answer I can parse and interpret instead of a "critical error" with a human readable message that could be anything.
It's as if you were querying a database with valid SQL and instead of getting an empty recordset because there are no matching records you got a critical error. Sometimes an empty result is a valid result.
The idea is that OMD knows about the dockapp-specific checks and is supposed to derive the host status on it's own. Current logic is that there must be a single running GUI application on each station.
Note that supernova is currently commented out in the STATIONS/list
- it should not be visible via your UI (at least by default). Also there may be other monitored agents at OMD (not listed in the list). But all the listed stations (excluding the commented-out) must be monitored via OMD under the same host-name (== station_id
from their corresponding station.cfg
)
That assumes:
But AT LEAST I ask:
Give me an error code for "no top app" instead of a generic critical error. I can parse and interpret an error code. Right now I have a line in my code that reads:
if (station.app_id === 'CRIT - CRITICAL - no running TOP app!') {
// handle the case
}
and that's unacceptable for something new (not legacy) we're developing ourselves. It makes no sense I'm parsing an arbitrary human readable string to handle an error condition I can understand and work around and inform the user of, instead of having to say "There's something wrong... either it's on fire or no top app running".
I am pretty sure that the critical state of a service is indicated via service check state... But it should also be possible to get the plugin exit codes or check the beginning of plugin output message for the following prefixes:
0 OK
1 WARN
2 CRIT 3 UNKN
dockapp_top1
(or hilbert_top1
) can possibly have the following outputs:
OK - TOP: .... WARNING - \d+ TOPs: ... CRITICAL - no running TOP app! CRITICAL - cannot determine TOP app's info
Here's a TL;DR for you:
The system starts and stops stations, so at some points in its normal operation stations won't have any top app running. This should not be informed as a critical error, because it's normal.
The state is informed through an ID, a prefix, and a human readable message. Currently I need to parse the third thing to know I must ignore the "critical" error. That's bad.
Currently my query of MKLivestatus is failing because I'm getting the following report when I query
response is
Notice the first row returned:
I consider that no top running app is a "normal(ish)" situation, not a critical error... it'd be better if the standard return could contemplate the fact that no app is running so I can parse critical errors as such and generate an error in the log.
PS: I'm parsing the return with a regexp, so we should have a well defined (documented, ideally) response... otherwise the regexp will break if something I've never seen before appears, like with this case.