sensu / sensu-go

Simple. Scalable. Multi-cloud monitoring.
https://sensu.io
MIT License
1.02k stars 175 forks source link

Check occurrences property doesn't update #1624

Closed aalkema closed 5 years ago

aalkema commented 6 years ago

I'm trying to set up a filter to activate a handler to send an alert to OpsGenie. I want to use the hourly example filter from here: https://docs.sensu.io/sensu-core/2.0/guides/reduce-alert-fatigue/. If I set this up with my check, it gets handled every time. I think this is because when I run sensuctl event list the occurrences property of the check is always 1. The code here: https://github.com/sensu/sensu-go/blob/master/backend/eventd/eventd.go#L227 implies to me that it should be the count of checks run that have returned the most recent status.

Expected Behavior

I expect this handler to be run when the check switches states, or once an hour.

Current Behavior

The handler is run every time the check fires.

Possible Solution

I believe the issue is with the occurrences property.

Steps to Reproduce (for bugs)

  1. sensuctl filter create hourly --action allow --statements "event.Check.Occurrences == 1 || event.Check.Occurrences % (3600 / event.Check.Interval) == 0"
  2. Any handler using the hourly filter
  3. Any check that you can toggle for testing
  4. Attach the handler to the check
  5. The handler will fire each time

Context

We want to create alerts in OpsGenie when these checks fail, then clear them when the check goes back to a good status.

Your Environment

  1. sensu-backend version 2.0.0-nightly#82c3128, build 82c31284eb9a378c60869544d95a2cc889a0d686, built 2018-05-23T09:22:02+0000
  2. sensu-agent version #f70b9f8, build f70b9f8a14ea926e4ce85467beecfe5cf731513d, built 2018-05-08T04:01:02+0000
  3. Windows 10 agent, Amazon Linux backend
echlebek commented 6 years ago

Hi there, unfortunately I haven't been able to reproduce your issue.

The occurrences field is indeed the count of failing checks. In my environment, this number updates reliably. (Using a check with command 'exit 2')

Note that occurrences are reset when the check is flapping, or when the exit status changes.

Are you sure that your check is not flapping, and the exit status is not changing between checks?

Are there any error messages in the backend log?

Thank you for the bug report, hopefully we can sort out what the problem is! :)

aalkema commented 6 years ago

Hey,

I am still seeing it, here's a gist with the event info. You can see even with several passes in a row after a fail I set up, it still says occurrences is 1: Link to gist

My backend is relatively up to date, but I am using a build of a windows agent someone sent me a couple weeks ago, version: sensu-agent version #f70b9f8, build f70b9f8a14ea926e4ce85467beecfe5cf731513d, built 2018-05-08T04:01:02+0000. It looks like the occurrences count is handled all in the backend though.

I see no errors in the log about that check, here's what I see when it comes in:

{"check":"check-{thing}","component":"pipelined","entity":"{entity}","environment":"default","level":"info","msg":"no handlers available","organization":"default","time":"2018-06-12T19:24:26Z"}

Anything else I should check?

Thanks!

echlebek commented 6 years ago

Thanks for the feedback, I'll take another stab at reproducing this tomorrow.

echlebek commented 6 years ago

Looks like 2.x is doing something slightly different than 1.x here. I'm going to file a PR to bring us in line with 1.x behaviour here.

echlebek commented 6 years ago

Hi @aalkema, @palourde mentioned something that made me wonder if I was misunderstanding your intended use case here. (See discussion here https://github.com/sensu/sensu-go/pull/1696)

In Sensu 2.x, passing checks cause events to occur as well as failing checks. So this makes the nature of occurrences somewhat different.

If your intent is for the handler to run on the first failing check, you can use the is_incident built-in filter. Would that resolve this issue for you?

aalkema commented 6 years ago

Hey @echlebek, sorry for the delay, I just got back from working on something else that came up. I see some work has been done here, but I'll let you know what I intended. I wanted this handler to fire whenever the check changes state (I was assuming occurrences was # of consecutive times this check has been in this state) and at least once an hour. I want it to fire on changing state so we can send an alert in opsgenie, and then to fire again when it goes green to clear that alert. Does that make sense? Am I wrong about my assumption for what occurrences means?

echlebek commented 6 years ago

Hi @aalkema,

Your issue prompted a long discussion internally. We realized that the design of occurrences in 2.x was somewhat flawed.

In 1.x, occurrences only exist for failing checks, since events only exist for failing checks. However in 2.x, events exist for all checks regardless of status.

We ported occurrences over without giving this much thought, reproducing how occurrences work faithfully. Unfortunately this doesn't capture the spirit of occurrences very well.

Have a look at #1766. Do you think if we implemented what's described in that issue, that your use case would be resolved?

aalkema commented 6 years ago

Yeah, that sounds great. I don't have a background in 1.x, so I didn't have expectations for that value. Incrementing it every time a check is the same as the previous one, and resetting to 1 when it's not makes a lot of sense to me. Let's you do some nice things with filters as well, occurrences == 1 always means it just changed state. Thanks!

pablolibo commented 5 years ago

Hey guys, thanks for the report, I have the same issue, let me know if we have any news

PD: I am using pagerduty

+1

palourde commented 5 years ago

It's been a long time but I just wanted to let everyone know we recently fixed some stuff around occurrences so the behavior is clearly defined: https://github.com/sensu/sensu-go/pull/2975.