mickem / nscp

NSClient++
http://nsclient.org
GNU General Public License v2.0
244 stars 94 forks source link

check_process/check_service - need more information #364

Open nmat opened 7 years ago

nmat commented 7 years ago

It would be nice to have a better way to present information in summary for a specific process.

To monitor a process it would be nice to be able to have a set filter that tells you more about the process for performance monitoring:

OK - [process] - CPU: [cpu_used] , mem: [mem_used_workingset] MB, Handles: [handles]

Giving a better option to have more in the summary of the check and also graphs for them will monitor the process even better. Is this possible right now? Because I can only see the bytes calculations in the nsclient for 0.5.0.64 at the moment, unless I use counters.

Would be nice to have more numbers to work with regarding (MB,GB) etc for those specifik checks.

mickem commented 7 years ago

Not sure I understand but sounds like your looking for detail-syntax ?

From: https://docs.nsclient.org/reference/windows/CheckSystem/#usage_11

Changing the returned text:

check_process process=explorer.exe "warn=working_set > 70m" "detail-syntax=${exe} ws:${working_set}, handles: ${handles}, user time:${user}s"
explorer.exe ws:77271040, handles: 800, user time:107s
Performance data: 'explorer.exe ws_size'=73M;70;0
nmat commented 7 years ago

I have tested various ways to use the detail-syntax with check_nrpe. However the output does not change as intended. Maybe this can be lack of knowledge however if I use:

./check_nrpe -H [ip] -c check_process -a "process=[process]" "warn=working_set > 70m" "detail-syntax=${exe} ws:${working_set}, handles: ${handles}, user time:${user}s"

The output will still be:

OK: all processes are ok

The detail syntax is very confusing in this case.

mickem commented 7 years ago

There are several format strings... You can see this by running the command with show-defaults:

heck_cpu show-default
L        cli OK: "filter=core = 'total'" "warning=load > 80" "critical=load > 90" "empty-state=ignored" "top-syntax=${status}: ${problem_list}" "ok-syntax=%(status): CPU load is ok." "detail-syntax=${time}: ${load}%" "perf-syntax=${core} ${time}"

So in your case what you want is most likely to:

  1. set "top-syntax=${status}: ${list}" to show list of items even when everything is ok (not just when there are problems)
  2. set "ok-syntax=none" to disable this.
  3. set detail-syntax to whatever you want..
nmat commented 7 years ago

Hello,

I tested the suggestion above with check_nrpe on OP5 and here is the line for nrpe:

./check_nrpe -H [host] -c check_process -a "process=[process]" "warn=working_set > 700M" "top-syntax=${status}: ${list}" "ok-syntax=none" "detail-syntax=${exe} ws:${working_set}, handles: ${handles}, user time:${user}s"

Result:

none|process state'=1;0;0 'Process ws_size'=68.941MB;700;0 'count'=1;0;0

So unfortunately it did not work as expected. :/

nmat commented 7 years ago

From my view it seems that the output result is different when using nrpe and when using local tests to run the command and if this is the case its hard to follow the documentation on the website where you expect the result to be the same in both tests.

mickem commented 7 years ago

Not sure I follow, the result is the same, the only difference is that options required from check_nrpe (which means you put a -a before the options as well as -H and -c)...

But more importantly check_nrpe is one of many options to interact with NSClient++ thus I tend to opt for the generic version in the docs... how to use check_nrpe can be found in the check_nrpe docs...

mickem commented 7 years ago

Your result yields the following:

check_process "process=[process]" "warn=working_set > 700M" "top-syntax=${status}: ${list}" "ok-syntax=none" "detail-syntax=${exe} ws:${working_set}, handles: ${handles}, user time:${user}s"
L        cli CRITICAL: CRITICAL: [process] ws:0, handles: 0, user time:0s
L        cli  Performance data: '[process] state'=0;0;0 '[process] ws_size'=0MB;700;0 'count'=1;0;0

In an ok scenario I get the same:

check_process "process=[process]" critical=none "warn=working_set > 7000M" "top-syntax=${status}: ${list}" "ok-syntax=none" "detail-syntax=${exe} ws:${working_set}, handles: ${handles}, user time:${user}s"
L        cli OK: OK: [process] ws:0, handles: 0, user time:0s
L        cli  Performance data: '[process] ws_size'=0GB;6.83593;0

So if you get something else it could be a bug which has since been fixed (as I am on the latest version).

Akira74 commented 7 years ago

Hi, I also can confirm the behaviour of the client in my environment. Version is 0.5.0.65 2016-11-13 and also in versions before I have this behaviour. Following comand line: /check_nrpe -H [IP]-c check_process -a "process=nscp.exe" "detail-syntax=${exe} ws:${working_set}, handles: ${handles}, user time:${user}s" results in

OK: all processes are ok.|'nscp.exe state'=1;0;0 'count'=1;0;0

I enabled debugging on the system I this is what I can see:

D  w32system Parsing: state != 'unreadable'
D  w32system Parsing succeeded: (tbd){(int)var:state ? (s){unreadable}}
D  w32system Type resolution succeeded: (bool){(int)var:state ? {ui:1}convert((s){unreadable})}
D  w32system Binding succeeded: (bool){(int)var:state ? {ui:1}convert((s){unreadable})}
D  w32system Static evaluation succeeded: (bool){(int)var:state ? {ui:1}convert((s){unreadable})}
D  w32system Parsing: state not in ('started')
D  w32system Parsing succeeded: (tbd){(int)var:state not in (s){started}}
D  w32system Type resolution succeeded: (bool){(int)var:state not in {ui:1}convert((s){started})}
D  w32system Binding succeeded: (bool){(int)var:state not in {ui:1}convert((s){started})}
D  w32system Static evaluation succeeded: (bool){(int)var:state not in {ui:1}convert((s){started})}
D  w32system Parsing: state = 'stopped'
D  w32system Parsing succeeded: (tbd){(int)var:state = (s){stopped}}
D  w32system Type resolution succeeded: (bool){(int)var:state = {ui:1}convert((s){stopped})}
D  w32system Binding succeeded: (bool){(int)var:state = {ui:1}convert((s){stopped})}
D  w32system Static evaluation succeeded: (bool){(int)var:state = {ui:1}convert((s){stopped})}
D  w32system Parsing: count = 0
D  w32system Parsing succeeded: (tbd){(int)var:count = (i){0}}
D  w32system Type resolution succeeded: (bool){(int)var:count = (i){0}}
D  w32system Binding succeeded: (bool){(int)var:count = (i){0}}
D  w32system Static evaluation succeeded: (bool){(int)var:count = (i){0}}
D  w32system Crit/warn/ok did not match:  ws:, handles: , user time:s
D  w32system Crit/warn/ok did not match: <END>

But if I run the following (only in Client as payload is too high) the result is as expected: check_process "top-syntax=${status}: ${list}" "detail-syntax=${exe} ws:${working_set}, handles: ${handles}, user time:${user}s" debug

also in debug log (just example) I can see a little bit more:

D  w32system Parsing: state != 'unreadable'
D  w32system Parsing succeeded: (tbd){(int)var:state ? (s){unreadable}}
D  w32system Type resolution succeeded: (bool){(int)var:state ? {ui:1}convert((s){unreadable})}
D  w32system Binding succeeded: (bool){(int)var:state ? {ui:1}convert((s){unreadable})}
D  w32system Static evaluation succeeded: (bool){(int)var:state ? {ui:1}convert((s){unreadable})}
D  w32system Parsing: state not in ('started')
D  w32system Parsing succeeded: (tbd){(int)var:state not in (s){started}}
D  w32system Type resolution succeeded: (bool){(int)var:state not in {ui:1}convert((s){started})}
D  w32system Binding succeeded: (bool){(int)var:state not in {ui:1}convert((s){started})}
D  w32system Static evaluation succeeded: (bool){(int)var:state not in {ui:1}convert((s){started})}
D  w32system Parsing: state = 'stopped'
D  w32system Parsing succeeded: (tbd){(int)var:state = (s){stopped}}
D  w32system Type resolution succeeded: (bool){(int)var:state = {ui:1}convert((s){stopped})}
D  w32system Binding succeeded: (bool){(int)var:state = {ui:1}convert((s){stopped})}
D  w32system Static evaluation succeeded: (bool){(int)var:state = {ui:1}convert((s){stopped})}
D  w32system Parsing: count = 0
D  w32system Parsing succeeded: (tbd){(int)var:count = (i){0}}
D  w32system Type resolution succeeded: (bool){(int)var:count = (i){0}}
D  w32system Binding succeeded: (bool){(int)var:count = (i){0}}
D  w32system Static evaluation succeeded: (bool){(int)var:count = (i){0}}
D  w32system Filter did not match:  ws:0, handles: 0, user time:0s
D  w32system Crit/warn/ok did not match: smss.exe ws:565248, handles: 32, user time:0s
D  w32system Crit/warn/ok did not match: csrss.exe ws:2891776, handles: 1099, user time:0s
D  w32system Crit/warn/ok did not match: wininit.exe ws:1843200, handles: 82, user time:0s
D  w32system Crit/warn/ok did not match: csrss.exe ws:95416320, handles: 1184, user time:1s
D  w32system Crit/warn/ok did not match: services.exe ws:11513856, handles: 391, user time:23s
D  w32system Crit/warn/ok did not match: winlogon.exe ws:4767744, handles: 135, user time:0s
D  w32system Crit/warn/ok did not match: lsass.exe ws:16502784, handles: 1248, user time:14s
D  w32system Crit/warn/ok did not match: lsm.exe ws:4210688, handles: 274, user time:0s
D  w32system Crit/warn/ok did not match: svchost.exe ws:9449472, handles: 445, user time:69s
D  w32system Crit/warn/ok did not match: svchost.exe ws:8556544, handles: 556, user time:2s
D  w32system Crit/warn/ok did not match: svchost.exe ws:17145856, handles: 615, user time:10s

every time I ass the "process" value, the result is not as the expected one. Only when I add a warning or critical command like "warn=working_set > 70m" I get this value also in the result...

mickem commented 7 years ago

With 0.5.0 I get the expected result so not sure what is amiss... Could you let me know if it is w32 or x64 as well as attach any relevant config?

check_process "process=explorer.exe" "warn=working_set > 700M" "top-syntax=${status}: ${list}" "ok-syntax=none" "detail-syntax=${exe} ws:${working_set}, handles: ${handles}, user time:${user}s"
L        cli OK: OK: explorer.exe ws:86052864, handles: 3027, user time:105s
L        cli  Performance data: 'explorer.exe state'=1;0;0 'explorer.exe ws_size'=82.0664MB;700;0 'count'=1;0;0

As well as:

check_process "process=nscp.exe" "warn=working_set > 700M" "top-syntax=${status}: ${list}" "ok-syntax=none" "detail-syntax=${exe} ws:${working_set}, handles: ${handles}, user time:${user}s"
L        cli OK: OK: nscp.exe ws:21393408, handles: 467, user time:15s, nscp.exe ws:44482560, handles: 439, user time:541s, nscp.exe ws:33116160, handles: 414, user time:0s
L        cli  Performance data: 'nscp.exe state'=1;0;0 'nscp.exe ws_size'=20.40234MB;700;0 'nscp.exe state'=1;0;0 'nscp.exe ws_size'=42.42187MB;700;0 'nscp.exe state'=1;0;0 'nscp.exe ws_size'=31.58203MB;700;0 'count'=3;0;0
nmat commented 7 years ago

Hello,

So I have tested a few versions now and here is the result:

check_process "process=nscp.exe" "warn=working_set > 700M" "detail-syntax=${exe} ws:${working_set}, handles: ${handles}, user time:${user}s" L cli OK: OK: all processes are ok. L cli Performance data: 'nscp.exe state'=1;0;0 'nscp.exe ws_size'=9.20312MB;700;0 'nscp.exe state'=1;0;0 'nscp.exe ws_size'=26.90234MB;700;0 'count'=2;0;0

check_process "process=nscp.exe" "warn=working_set > 700M" "ok-syntax=none" "detail-syntax=${exe} ws:${working_set}, handles: ${handles}, user time:${user}s" L cli OK: none L cli Performance data: 'nscp.exe state'=1;0;0 'nscp.exe ws_size'=8.9414MB;700;0 'nscp.exe state'=1;0;0 'nscp.exe ws_size'=26.89843MB;700;0 'count'=2;0;0

check_process "process=nscp.exe" "warn=working_set > 700M" "top-syntax=${status}: ${list}" "ok-syntax=none" "detail-syntax=${exe} ws:${working_set}, handles: ${handles}, user time:${user}s" L cli OK: OK: nscp.exe ws:9580544, handles: 377, user time:3s, nscp.exe ws:28319744, handles: 388, user time:0s L cli Performance data: 'nscp.exe state'=1;0;0 'nscp.exe ws_size'=9.13671MB;700;0 'nscp.exe state'=1;0;0 'nscp.exe ws_size'=27.00781MB;700;0 'count'=2;0;0

Now to understand why this was hard to figure out. Checking the documentation it looks like this:

check_process process=explorer.exe "warn=working_set > 70m" "detail-syntax=${exe} ws:${working_set}, handles: ${handles}, user time:${user}s" explorer.exe ws:77271040, handles: 800, user time:107s Performance data: 'explorer.exe ws_size'=73M;70;0

So the documentation is not really correct right? You need to add some extra parameters to get the result that is mentioned?

I am testing this on: OS: windows server2016 Bit: 64bit Powershell: 5.1.14393.1532

nmat commented 7 years ago

Note that I have not tested this with return from check_nrpe

nmat commented 7 years ago

All tests are done using: 0.5.1.44 of nsclient

Now I have tested the check_nrpe that is provided from OP5 monitoring system.

The result is still the same where I get the following results:

./check_nrpe -H $HOSTADDRESS$ -c check_process -a "process=nscp.exe" "warn=working_set > 1000M" "crit=working_set > 1300M" "ok-syntax=none" "detail-syntax=${exe} ws:${working_set}, handles: ${handles}, user time:${users}s"
none|'nscp.exe ws_size'=0.02587GB;0.97656;1.26953
./check_nrpe -H $HOSTADDRESS$ -c check_process -a "process=nscp.exe" "warn=working_set > 1000M" "crit=working_set > 1300M" "detail-syntax=${exe} ws:${working_set}, handles: ${handles}, user time:${users}s"
OK: all processes are ok.|'nscp.exe ws_size'=0.02587GB;0.97656;1.26953

In general. Running the same command from the server not using NRPE yields a different result then what NRPE is being returned. So running the command remotely to the server gives wrong information.