sensu-plugins / sensu-plugins-process-checks

This plugin provides native process instrumentation for monitoring and metrics collection, including: process status, uptime, thread count, and others.
http://sensu-plugins.io
MIT License
20 stars 55 forks source link

sensu-client process is not getting monitored by the check-process.rb script #55

Open gopalvd opened 6 years ago

gopalvd commented 6 years ago

Team, I have a strange behaviour. I am trying to monitor the sensu-client process running on my linux server. For that i have set up the check definition and installed the respective dependent gems. This is what i am seeing.

When i run the script manually i see CheckProcess OK and found 1 process matching. [root@XXXXXXXX checks]# /opt/sensu/embedded/bin/ruby /etc/sensu/plugins/check-process.rb -p sensu-client CheckProcess OK: Found 1 matching processes; cmd /sensu-client/

When sensu-client itself is running the check, I see this. {"timestamp":"2017-11-09T03:53:50.181638-0600","level":"info","message":"publishing check result","payload":{"client":"xxxxxxxxxx","check":{"command":"/etc/sensu/plugins/check-process.rb -p sensu-client","subscribers":[],"standalone":true,"handlers":["default","mailer"],"interval":30,"type":"standard","mail_to":"xxxxxxxxxxxx@target.com","name":"check_process-test","issued":1510221230,"executed":1510221230,"duration":0.17,"output":"CheckProcess CRITICAL: Found 0 matching processes; cmd /sensu-client/\n","status":2}}}

Can some one help, what could be the issue?

majormoses commented 6 years ago

Hmm that is indeed strange, when I have the time I will try to replicate in my environment. I never really considered using sensu to monitor the sensu process as how would it even get the request from the transport (rabbitmq|redis) if the process is dead? Even in the case of standalone checks we still have the same problem of it being able to locally schedule itself.

I think you probably should leverage native functionality in your process supervisor such as upstart, runit, systemd, etc to attempt restarting the process if it dies. It's kinda a who watches the watcher problem. Here are some other suggestions which are not mutually exclusive with what was proposed above:

That being said if the process is in fact up then there is definately a bug, config, or environment issue somewhere.

Can you please include the version of this plugin, sensu-server, and sensu-client?

gopalvd commented 6 years ago

@majormoses Thanks for looking in to this. I totally agree that monitoring sensu-client by sensu is not a great idea. Also we are using a diff tools for that. And yes this plugin is working perfectly when using other processes like crond, chef-client systemd etc. I noticed this strange behaviour only when i did for sensu-client. And this was done with out any intention. I want to test the plugin and gave the sensu-client process to check and noticed this.

Yes to me seems like a bug.

Here are the details that you are looking for. gem version for plugins ---- 2.5.0 sensu-client version --- "sensu-0.26.3-1.x86_64" sensu-client version --- "sensu-1.0.2-1.el7.x86_64"

gopalvd commented 6 years ago

Sorry the last is the sensu-server version sensu-server version --- "sensu-1.0.2-1.el7.x86_64"

majormoses commented 6 years ago

Hmm this is what I show in my environment and I can't explain it, I will look through the code when I have some time. This is what I got manually which matches your output:

babrams@ip-10-55-141-110:~$ /opt/sensu/embedded/bin/check-process.rb -p sensu-client
CheckProcess OK: Found 1 matching processes; cmd /sensu-client/

I was scratching my head for a minute (clearly tired) as this is what I initially did and got back:

babrams@ip-10-55-141-110:~$ /opt/sensu/embedded/bin/check-process.rb - sensu-client
CheckProcess OK: Found 121 matching processes
gopalvd commented 6 years ago

Means a bug? or some options needs to be used in the command?

majormoses commented 6 years ago

I was missing the p after the - so it was returning matching all processes! which basically matches:

babrams@ip-10-55-141-110:~$ ps -Al | wc -l
122
gopalvd commented 6 years ago

I did the same CheckProcess OK: Found 402 matching processes

gopalvd commented 6 years ago

Correct when i run the command, its giving one matching process. But when sensu-client is executing i see 0 processes running and throwing the error. CheckProcess CRITICAL: Found 0 matching processes; cmd /sensu-client/

gopalvd commented 6 years ago

@majormoses Any updates you have on this issue?

majormoses commented 6 years ago

No, I honestly have not really thought about it much as it is such a limited use case to have sensu check if sensu is running. Without using TTLs there is of no value, I marked it as a low priority to reflect that it will not likely see a quick resolution without another contributor who is more motivated to solve it. It is intriguing problem so if someone does not volunteer to triage it I will get to it when I have the time.

xlr5 commented 6 years ago

Hello, I ran into this issue today as well. IMO monitoring the sensu client process is neccessary to define proper dependencies. I don't want any further notifications in case the sensu-client dies. And it's like described above. Executing the check-process.rb on the shell returns 1 process running., while doing the same check via sensu returns no running process.

Can you please take a look?

majormoses commented 6 years ago

I agree its a bug but I don't see much value in fixing this bug. By the very nature of this check and how sensu works if the sensu-client process is not running this would never be executed on. I would suggest looking at say the keepalive which is a built in TTL check that is meant to solve this exact problem. You can also add a very simple check which runs /bin/true and another TTL check on top of that which is essentially what the built in healthcheck does for you other than it is a fixed 20 second interval. Other options include scheduling a wrapper script via cron and if there are issues (check $?) to update it via the servers api or using monit and update the api. Bottom line is unless it is a TTL check there is 0 value in executing this from within the sensu client process as it is chicken and egg scenario.

I do not work for sensu or am I monetarily compensated for maintaining the plugins. There are lots of issues for me to look at to fix bugs, enhance plugins, and do housekeeping. This means I need to prioritize the things I spend my efforts on. While I find the problem itself intriguing the use case makes little sense to me so I will not be prioritizing this over things that I consider bring greater value to the community. I am more than happy to review a PR should you or someone else find the issue.

If you think this is important you have a couple options:

majormoses commented 6 years ago

I have pinged the other maintainers as well for another set of :eyes: as maybe they will have a different perspective than I do.

jaredledvina commented 6 years ago

I agree with @majormoses that the use-case here isn't totally clear. Looking through the code, I would be interested to see if pass -m or possibly -M causes this to work though: https://github.com/sensu-plugins/sensu-plugins-process-checks/blob/master/bin/check-process.rb#L79-L84

majormoses commented 6 years ago

@jaredledvina good call, these options might help but again this gives you no real value without adding a TTL check and I would suggest using the keepalive instead as that is it's whole purpose

The options are defined here:

The code that rejects the processes unless args are changed:

huynt1979 commented 6 years ago

@majormoses The sane philosophy for this would be using another tool, for example as simple as cron. to monitor the monitoring tool like sensu-client. It's weird to have a monitoring tool watch ifself because when it dies, what would alert you... you are arealdy DEAD? Anyway, if you insist this line may shed more light on why @gopalvd saw what he saw... https://github.com/sensu-plugins/sensu-plugins-process-checks/blob/master/bin/check-process.rb#L258-L259

majormoses commented 6 years ago

The sane philosophy for this would be using another tool, for example as simple as cron. to monitor the monitoring tool like sensu-client. It's weird to have a monitoring tool watch ifself because when it dies, what would alert you... you are arealdy DEAD?

Yup the only real other alternative which is native to sensu that provides value in certain scenarios such as a client dying but the server remains functional is a TTL. Bottom line even if adding those args solves the problems it will not provide any value without adding a TTL. If you are gonna do that you already have something built into sensu that does this for you its called the keepalive check. It is hardcoded to send an update every 20 seconds and you can adjust your keepalive thresholds and dependencies as you see fit.

majormoses commented 6 years ago

After reading through the code and seeing those options pointed out I highly doubt this is a bug anymore. It sounds like intended functionality and with sane defaults. The use case is still flawed but it would appear to be possible.

xlr5 commented 6 years ago

Well, after reading your arguments I came to the conclusion that i need to rethink my approach. It was with my nagios installation where i used service dependencies on the check if nrpe is running to suppress further service notifications when the host died.

I admit that i need to learn better how sensu handles those situations.

majormoses commented 6 years ago

You can do the same thing, just use keepalive. As this clearly is something that multiple people have tried and does not work I feel like we should add some documentation to help out users.