sensu / sensu-go

Simple. Scalable. Multi-cloud monitoring.
https://sensu.io
MIT License
1.02k stars 175 forks source link

Silence with expire_on_resolve set did not clear on recovery #3139

Closed roganartu closed 5 years ago

roganartu commented 5 years ago

The expire_on_resolve property of a Silenced entry is intended to cause the silence to automatically be deleted when the event resolves, however I have encountered an event in an OK state with an open Silence that has expire_on_resolve set to True.

Expected Behavior

A Silence with expire_on_resolve set to True should be deleted when the event status is 0.

Current Behavior

I have an event with a status of 0 that also has a active Silence with expire_on_resolve set to True.

Here are the details. I changed the hostname, check name, and output to placeholders for simplicity.

╰ sensuctl event info --format table host foo
=== host - foo
Entity:      host
Check:       foo
Output:      some output
Status:      0
History:     3,2,2,2,2,3,2,2,2,2,2,3,2,2,2,2,2,0,0,0,0
Silenced:    true
Silenced By: entity:host:foo
Timestamp:   2019-07-12 16:00:04 +0000 UTC
╰ sensuctl silenced info --check foo  --subscription entity:host    
{
  "metadata": {
    "name": "entity:host:foo",
    "namespace": "default",
  },
  "expire": -1,
  "expire_on_resolve": true,
  "creator": "admin",
  "check": "foo",
  "reason": "some reason",
  "subscription": "entity:host",
  "begin": 1562709459
}

Note the begin timestamp is several days ago, while the check only recovered earlier today. I would have expected this Silence to have been deleted when the check transitioned from status == 2 to status == 0 in the history above.

Steps to Reproduce (for bugs)

I have not attempted to reproduce with a new silence and a fake check so I am unsure if this is a bug for all Silenced entries or just some kind of race condition.

There are no entries in the logs matching the term silence that aren't from the apid component, so I'm not sure where to start looking at this

Possible Solution

I'm not sure why this is occuring from a quick look over the source here:

https://github.com/sensu/sensu-go/blob/27ad8a3db009abfee339de2ad649a7f74266291f/backend/eventd/silenced.go#L54-L80

However, I would personally argue that if it's a transient error then only applying expire_on_resolve to DOWN -> OK transitions makes it confusing to users and less robust. An expire_on_resolve silence for an OK alert doesn't really make any sense to me. I would imagine that it was added with the intent of allowing users to create an "ack"-like silence, that only persists as long as the event is in a down state.

In any case, I think this pipeline probably needs more logging to determine why this is occurring. There doesn't seem to be any logging during the happy-path where a silence is successfully deleted, which makes it difficult to determine whether this code path was even triggered as expected.

Your Environment

roganartu commented 5 years ago

I no longer think this is a race condition as it seems to happen for every silence we create with expire_on_resolve set. We currently have 11 events with status: 0 that have an open expire_on_resolve: true silence that was created before the check resolved.

This makes silences very dangerous in their current state, as a check can recover and have a lingering silence that suppresses future alerts at any time. This is exacerbated by the fact that you would normally not set a expiry date on a silence that you are setting expire_on_resolve on, so these are effectively permanent alert suppressions unless someone notices the silence didn't clear and manually deletes it.

rgeniesse commented 5 years ago

Steps to reproduce:

1.) Add a check that will be in a warn or critical state. 2.) Once that check goes to warn or critical, silence the event in the Dashboard and ensuring to check the "expire when a matching check resolves" option. 3.) Have the check go into an OK state. 4.) Observe that the silence entry remains.

I used the following check and made the warn threshold 5 to force a non-ok event, then pushed it back to 80 to force it into ok.

---
type: CheckConfig
api_version: core/v2
metadata:
  namespace: default
  name: check_disk
spec:
  command: check-disk-usage.rb -w 80 -c 90
  subscriptions:
  - system
  interval: 60
  publish: true
  handlers:
  - slack
  runtime_assets:
  - sensu-plugins-disk-checks
  - sensu-ruby-runtime
---
type: Asset
api_version: core/v2
metadata:
  name: sensu-plugins-disk-checks
  namespace: default
spec:
  url: https://github.com/sensu-plugins/sensu-plugins-disk-checks/releases/download/5.0.0/sensu-plugins-disk-checks_5.0.0_centos_linux_amd64.tar.gz
  sha512: 0ce9d52b270b77f4cab754e55732ae002228201d0bd01a89b954a0655b88c1ee6546e2f82cfd1eec04689af90ad940cde128e8867912d9e415f4a58d7fdcdadf
---
type: Asset
api_version: core/v2
metadata:
  name: sensu-ruby-runtime
  namespace: default
spec:
  url: https://github.com/sensu/sensu-ruby-runtime/releases/download/0.0.4/sensu-ruby-runtime_0.0.4_centos_linux_amd64.tar.gz
  sha512: 1c9f0aff8f7f7dfcf07eb75f48c3b7ad6709f2bd68f2287b4bd07979e6fe12c2ab69d1ecf5d4b9b9ed7b96cd4cda5e55c116ea76ce3d9db9ff74538f0ea2317a
roganartu commented 5 years ago

So the problem is that event history is null when handleExpireOnResolveEntries is called, no matter what the history actually is.

I added a bunch of debug logging to that method and then spun up a backend in docker and ran the following:

$ cat silenced.json
{
  "type": "CheckConfig",
  "api_version": "core/v2",
  "metadata": {
    "name": "silenced_test",
    "namespace": "default"
  },
  "spec": {
    "command": "exit 1",
    "handlers": [],
    "interval": 60,
    "publish": true,
    "subscriptions": [
        "switches"
    ],
    "timeout": 5,
    "ttl": 120
  }
}

$ sensuctl create -f silenced.json
Created

then waited for the check to fail:

 sensuctl event list
  Entity       Check                                    Output                                Status   Silenced             Timestamp
 ──────── ─────────────── ────────────────────────────────────────────────────────────────── ──────── ────────── ───────────────────────────────
  agent1   keepalive       Keepalive last sent from agent1 at 2019-08-12 20:36:51 +0000 UTC        0   false      2019-08-12 16:36:51 -0400 EDT
  agent1   silenced_test                                                                           1   false      2019-08-12 16:38:26 -0400 EDT

Observing the following in the logs:

backend1_1  | {"component":"schedulerd","level":"debug","msg":"check is not subdued","name":"silenced_test","namespace":"default","scheduler_type":"interval","time":"2019-08-12T20:37:26Z"}
backend1_1  | {"check":"silenced_test","component":"schedulerd","level":"debug","msg":"sending check request","time":"2019-08-12T20:37:26Z","topic":"sensu:check:default:switches"}
agent1_1    | {"component":"agent","content_type":"application/octet-stream","level":"info","msg":"message received","payload_size":63,"time":"2019-08-12T20:37:26Z","type":"check_request"}
agent1_1    | {"component":"agent","level":"info","msg":"scheduling check execution: silenced_test","time":"2019-08-12T20:37:26Z"}
backend1_1  | {"component":"agentd","level":"debug","msg":"session - sending message","payload_size":63,"time":"2019-08-12T20:37:26Z"}
agent1_1    | {"component":"agent","content_type":"application/octet-stream","level":"info","msg":"sending message","payload_size":354,"time":"2019-08-12T20:37:26Z","type":"event"}
agent1_1    | {"component":"agent","level":"info","msg":"sending event","time":"2019-08-12T20:37:26Z"}
backend1_1  | {"component":"eventd","event":{"timestamp":1565642246,"entity":{"entity_class":"agent","system":{"hostname":"agent1","os":"linux","platform":"alpine","platform_family":"alpine","platform_version":"3.8.4","network":{"interfaces":[{"name":"lo","addresses":["127.0.0.1/8"]},{"name":"eth0","mac":"02:42:ac:15:00:03","addresses":["172.21.0.3/16"]}]},"arch":"amd64"},"subscriptions":["switches","entity:agent1"],"last_seen":1565642211,"deregister":false,"deregistration":{},"user":"agent","redact":["password","passwd","pass","api_key","api_token","access_key","secret_key","private_key","secret"],"metadata":{"name":"agent1","namespace":"default"}},"check":{"command":"exit 1","handlers":[],"high_flap_threshold":0,"interval":60,"low_flap_threshold":0,"publish":true,"runtime_assets":null,"subscriptions":["switches"],"proxy_entity_name":"","check_hooks":null,"stdin":false,"subdue":null,"ttl":120,"timeout":5,"round_robin":false,"duration":0.001616361,"executed":1565642246,"history":null,"issued":1565642246,"output":"","status":1,"total_state_change":0,"last_ok":0,"occurrences":0,"occurrences_watermark":0,"output_metric_format":"","output_metric_handlers":null,"env_vars":null,"metadata":{"name":"silenced_test","namespace":"default"}},"metadata":{"namespace":"default"}},"level":"debug","msg":"handling expireOnResolve entries","time":"2019-08-12T20:37:26Z"}
backend1_1  | {"component":"eventd","event":{"timestamp":1565642246,"entity":{"entity_class":"agent","system":{"hostname":"agent1","os":"linux","platform":"alpine","platform_family":"alpine","platform_version":"3.8.4","network":{"interfaces":[{"name":"lo","addresses":["127.0.0.1/8"]},{"name":"eth0","mac":"02:42:ac:15:00:03","addresses":["172.21.0.3/16"]}]},"arch":"amd64"},"subscriptions":["switches","entity:agent1"],"last_seen":1565642211,"deregister":false,"deregistration":{},"user":"agent","redact":["password","passwd","pass","api_key","api_token","access_key","secret_key","private_key","secret"],"metadata":{"name":"agent1","namespace":"default"}},"check":{"command":"exit 1","handlers":[],"high_flap_threshold":0,"interval":60,"low_flap_threshold":0,"publish":true,"runtime_assets":null,"subscriptions":["switches"],"proxy_entity_name":"","check_hooks":null,"stdin":false,"subdue":null,"ttl":120,"timeout":5,"round_robin":false,"duration":0.001616361,"executed":1565642246,"history":null,"issued":1565642246,"output":"","status":1,"total_state_change":0,"last_ok":0,"occurrences":0,"occurrences_watermark":0,"output_metric_format":"","output_metric_handlers":null,"env_vars":null,"metadata":{"name":"silenced_test","namespace":"default"}},"metadata":{"namespace":"default"}},"level":"debug","msg":"hasCheck: true | isResolution: false","time":"2019-08-12T20:37:26Z"}
backend1_1  | {"component":"eventd","level":"debug","msg":"/sensu.io/switchsets/eventd/default/silenced_test/agent1 alive: 120","time":"2019-08-12T20:37:26Z"}
backend1_1  | {"check":"silenced_test","component":"eventd","entity":"agent1","level":"info","msg":"check TTL reset","namespace":"default","previous_status":"dead","status":"alive","time":"2019-08-12T20:37:26Z"}
backend1_1  | {"check_name":"silenced_test","check_namespace":"default","component":"pipelined","entity_name":"agent1","entity_namespace":"default","hooks":null,"level":"debug","msg":"received event","silenced":[],"time":"2019-08-12T20:37:26Z","timestamp":1565642246}

Since this was the first occurrence, history being null might make sense.

I then created a silence and waited for the check to run again so that it was definitely flagged as silenced (confirming that the silence matched the event correctly):

 sensuctl silenced create --check silenced_test --reason testing --expire-on-resolve
Created

 sensuctl silenced list
       Name         Subscription       Check              Begin          Expire   ExpireOnResolve   Creator   Reason    Namespac
 ───────────────── ────────────── ─────────────── ───────────────────── ──────── ───────────────── ───────── ───────── ───────────
  *:silenced_test                  silenced_test   12 Aug 19 16:39 EDT   -1s      true              admin     testing   default

<wait a little bit>

 sensuctl event list   
  Entity       Check                                    Output                                Status   Silenced             Timestamp
 ──────── ─────────────── ────────────────────────────────────────────────────────────────── ──────── ────────── ───────────────────────────────
  agent1   keepalive       Keepalive last sent from agent1 at 2019-08-12 20:36:51 +0000 UTC        0   false      2019-08-12 16:36:51 -0400 EDT
  agent1   silenced_test                                                                           1   true       2019-08-12 16:40:26 -0400 EDT

Then I forced the check into an OK state, and observed that event.IsResolution() in handleExpireOnResolveEntries continued to resolve to false, because event.check.history was nil instead of containing the expected history (which appears when running sensuctl event info --format json:

 sensuctl create -f fix_silenced.json # where this is the same as silenced.json but exit 0 instead of exit 1

Logs:

backend1_1  | {"check":"silenced_test","component":"schedulerd","level":"debug","msg":"sending check request","time":"2019-08-12T20:41:26Z","topic":"sensu:check:default:switches"}
backend1_1  | {"component":"agentd","level":"debug","msg":"session - sending message","payload_size":63,"time":"2019-08-12T20:41:26Z"}
agent1_1    | {"component":"agent","content_type":"application/octet-stream","level":"info","msg":"message received","payload_size":63,"time":"2019-08-12T20:41:26Z","type":"check_request"}
agent1_1    | {"component":"agent","level":"info","msg":"scheduling check execution: silenced_test","time":"2019-08-12T20:41:26Z"}
agent1_1    | {"component":"agent","content_type":"application/octet-stream","level":"info","msg":"sending message","payload_size":351,"time":"2019-08-12T20:41:26Z","type":"event"}
agent1_1    | {"component":"agent","level":"info","msg":"sending event","time":"2019-08-12T20:41:26Z"}
backend1_1  | {"component":"eventd","event":{"timestamp":1565642486,"entity":{"entity_class":"agent","system":{"hostname":"agent1","os":"linux","platform":"alpine","platform_family":"alpine","platform_version":"3.8.4","network":{"interfaces":[{"name":"lo","addresses":["127.0.0.1/8"]},{"name":"eth0","mac":"02:42:ac:15:00:03","addresses":["172.21.0.3/16"]}]},"arch":"amd64"},"subscriptions":["switches","entity:agent1"],"last_seen":1565642211,"deregister":false,"deregistration":{},"user":"agent","redact":["password","passwd","pass","api_key","api_token","access_key","secret_key","private_key","secret"],"metadata":{"name":"agent1","namespace":"default"}},"check":{"command":"exit 0","handlers":[],"high_flap_threshold":0,"interval":60,"low_flap_threshold":0,"publish":true,"runtime_assets":null,"subscriptions":["switches"],"proxy_entity_name":"","check_hooks":null,"stdin":false,"subdue":null,"ttl":120,"timeout":5,"round_robin":false,"duration":0.00166182,"executed":1565642486,"history":null,"issued":1565642486,"output":"","status":0,"total_state_change":0,"last_ok":0,"occurrences":0,"occurrences_watermark":0,"silenced":["*:silenced_test"],"output_metric_format":"","output_metric_handlers":null,"env_vars":null,"metadata":{"name":"silenced_test","namespace":"default"}},"metadata":{"namespace":"default"}},"level":"debug","msg":"handling expireOnResolve entries","time":"2019-08-12T20:41:26Z"}
backend1_1  | {"component":"eventd","event":{"timestamp":1565642486,"entity":{"entity_class":"agent","system":{"hostname":"agent1","os":"linux","platform":"alpine","platform_family":"alpine","platform_version":"3.8.4","network":{"interfaces":[{"name":"lo","addresses":["127.0.0.1/8"]},{"name":"eth0","mac":"02:42:ac:15:00:03","addresses":["172.21.0.3/16"]}]},"arch":"amd64"},"subscriptions":["switches","entity:agent1"],"last_seen":1565642211,"deregister":false,"deregistration":{},"user":"agent","redact":["password","passwd","pass","api_key","api_token","access_key","secret_key","private_key","secret"],"metadata":{"name":"agent1","namespace":"default"}},"check":{"command":"exit 0","handlers":[],"high_flap_threshold":0,"interval":60,"low_flap_threshold":0,"publish":true,"runtime_assets":null,"subscriptions":["switches"],"proxy_entity_name":"","check_hooks":null,"stdin":false,"subdue":null,"ttl":120,"timeout":5,"round_robin":false,"duration":0.00166182,"executed":1565642486,"history":null,"issued":1565642486,"output":"","status":0,"total_state_change":0,"last_ok":0,"occurrences":0,"occurrences_watermark":0,"silenced":["*:silenced_test"],"output_metric_format":"","output_metric_handlers":null,"env_vars":null,"metadata":{"name":"silenced_test","namespace":"default"}},"metadata":{"namespace":"default"}},"level":"debug","msg":"hasCheck: true | isResolution: false","time":"2019-08-12T20:41:26Z"}
backend1_1  | {"component":"eventd","level":"debug","msg":"/sensu.io/switchsets/eventd/default/silenced_test/agent1 alive: 120","time":"2019-08-12T20:41:26Z"}
backend1_1  | {"check_name":"silenced_test","check_namespace":"default","component":"pipelined","entity_name":"agent1","entity_namespace":"default","hooks":null,"level":"debug","msg":"received event","silenced":["*:silenced_test"],"time":"2019-08-12T20:41:26Z","timestamp":1565642486}

All of these log messages are coming from the second log statement here:

func handleExpireOnResolveEntries(ctx context.Context, event *corev2.Event, store store.Store) error {
    l := logger.WithField("event", event)
    l.Debug("handling expireOnResolve entries")
    // Make sure we have a check and that the event is a resolution
    if !event.HasCheck() || !event.IsResolution() {
        l.Debugf("hasCheck: %t | isResolution: %t", event.HasCheck(), event.IsResolution())
        return nil
    }
    ...

because !event.IsResolution() is incorrectly evaluating to true.

I believe the reason history is nil here is that the event being passed to handleExpireOnResolveEntries is the event as received from the agent or API, that has not yet been merged with an existing event in store. See: https://github.com/sensu/sensu-go/blob/88f30484eb2f68eeb151e53baaf68bd2ac095ac0/backend/eventd/eventd.go#L222-L230

The event is merged with existing events in the store by e.eventStore.UpdateEvent(ctx, event).

The easiest way to fix this might be to just move the call to handleExpireOnResolveEntries to after the call to e.eventStore.UpdateEvent(ctx, event).

I still think that Sensu should (at least periodically) clean up any expire_on_resolve silenced entries that correspond to passing events, because it is possible that the deletion here will fail and having an open expire_on_resolve silence for a healthy check is unexpected and dangerous. Perhaps that should be a separate feature request to this though, because it requires some thought to approach as checking for silences in the store for every received passing check may have performance implications.

nikkictl commented 5 years ago

Thanks for explaining the issue and potential solution in detail @roganartu! I'm able to reproduce the issue with the following resources, and manually resolving the event with sensuctl (sensuctl event resolve [entity] [check]). I'll take everything you've listed in consideration with my solution 😄

type: Namespace
api_version: core/v2
metadata:
  name: silence
spec:
  name: silence
---
type: CheckConfig
api_version: core/v2
metadata:
  name: expire-on-resolve
  namespace: silence
spec:
  check_hooks: null
  command: "false"
  env_vars: null
  handlers: []
  high_flap_threshold: 0
  interval: 60
  low_flap_threshold: 0
  output_metric_format: ""
  output_metric_handlers: null
  proxy_entity_name: ""
  publish: true
  round_robin: false
  runtime_assets: null
  stdin: false
  subdue: null
  subscriptions:
  - silence
  timeout: 0
  ttl: 0
---
type: Silenced
api_version: core/v2
metadata:
  name: '*:expire-on-resolve'
  namespace: silence
spec:
  begin: 1565897528
  check: expire-on-resolve
  creator: admin
  expire: -1
  expire_on_resolve: true
  reason: test