sensu-plugins / sensu-plugins-cpu-checks

This plugin provides native CPU instrumentation for monitoring and metrics collection, including: CPU usage and metrics for user, nice, system, idle, iowait, irq, softirq, steal, and guest.
http://sensu-plugins.io
MIT License
13 stars 34 forks source link

check-cpu.rb alerts sometimes regardless of load on cpu #35

Closed infa-ddeore closed 6 years ago

infa-ddeore commented 6 years ago

check-cpu.rb alerts sometimes with total cpu being 100% busy, but the top output captured at the same time as a part of hooks doesn't show any load.

Explanation by @majormoses:

the way that it detects it is by gathering the current and then gathering it again (defaults to 1s) and measuring the difference. As the very act of doing so means that you are inflating the cpu utilization by loading up ruby. In addition because of the short timespan it collects data for the numbers will also not be 100% accurate. You can use the --sleep option to control how long it waits before using another data point, the longer you wait the more accurate it will be. In an ideal world I would opt for changing the default value to 5 as that is what I found was a good compromise for accuracy and being able to still get very granular/frequent checks but the problem is that someone might be setting their thresholds to less than 5 seconds which would cause issues as you'd have multiple instances of the check running at the same time. I am willing to do it as a breaking change as I think its a good move forward, just giving context of where it is and why it was not changed previously.

Change default 1 seconds to something higher to get more accurate data and avoid un-necessary alerts.

majormoses commented 6 years ago

@infa-ddeore Thanks for opening this up as your first issue, Welcome to Github!

I will submit a PR for this shortly.

majormoses commented 6 years ago

@infa-ddeore did you find that 5 was a reasonable number on your hardware profile? The more constrained the resources are the more impact the poll interval (sleep) will have. On my local machine I noticed very little difference because I have plenty of CPU and memory but when I last did that testing I was testing it based on an aws vm (m4.large) if I recall correctly. I think its worth exploring if there might be a better default than 5. From my tests so far it looks pretty close to top usage at 5, at worst its been inflated by maybe as much as 2-3 percent which I'd say is pretty negligable. Also keep in mind I have lots of things running on my local machine and pretty much anything could cause a 2-3% variation between captures. I have been comparing the results to top -d 5 to make sure that we are not comparing apples and oranges.

majormoses commented 6 years ago

I have opened a pull request looking to change the default, can you share any information on how much of an improvement 5 seconds is for you and if you found some other number that seemed to be better for you?

infa-ddeore commented 6 years ago

I have asked the concerned team to use --sleep 5 so I can confirm only after they see the improvement. But as a default I am setting up 5 seconds delay for the instances that I own.

majormoses commented 6 years ago

@infa-ddeore have you heard back from that team if that helped?

infa-ddeore commented 6 years ago

they added this option just today, since the issue is intermittent we need to monitor for few days. I will update after few days (about 6-7)

infa-ddeore commented 6 years ago

there has not been any false cpu alert so far after adding 5 seconds of sleep

infa-ddeore commented 6 years ago

Should we give a warning in README about not having interval less than sleep, since sleep default is 5 seconds now

majormoses commented 6 years ago

I was thinking about that, I think its OK as is. It's called out clearly in the changelog and realistically if someone wants to schedule it more frequently than every 10 seconds they are probably ok with opening the code or the changelog. I would not oppose a PR to add some blurb about it though.