Closed roganartu closed 5 years ago
I no longer think this is a race condition as it seems to happen for every silence we create with expire_on_resolve
set. We currently have 11 events with status: 0
that have an open expire_on_resolve: true
silence that was created before the check resolved.
This makes silences very dangerous in their current state, as a check can recover and have a lingering silence that suppresses future alerts at any time. This is exacerbated by the fact that you would normally not set a expiry date on a silence that you are setting expire_on_resolve
on, so these are effectively permanent alert suppressions unless someone notices the silence didn't clear and manually deletes it.
Steps to reproduce:
1.) Add a check that will be in a warn or critical state. 2.) Once that check goes to warn or critical, silence the event in the Dashboard and ensuring to check the "expire when a matching check resolves" option. 3.) Have the check go into an OK state. 4.) Observe that the silence entry remains.
I used the following check and made the warn threshold 5 to force a non-ok event, then pushed it back to 80 to force it into ok.
---
type: CheckConfig
api_version: core/v2
metadata:
namespace: default
name: check_disk
spec:
command: check-disk-usage.rb -w 80 -c 90
subscriptions:
- system
interval: 60
publish: true
handlers:
- slack
runtime_assets:
- sensu-plugins-disk-checks
- sensu-ruby-runtime
---
type: Asset
api_version: core/v2
metadata:
name: sensu-plugins-disk-checks
namespace: default
spec:
url: https://github.com/sensu-plugins/sensu-plugins-disk-checks/releases/download/5.0.0/sensu-plugins-disk-checks_5.0.0_centos_linux_amd64.tar.gz
sha512: 0ce9d52b270b77f4cab754e55732ae002228201d0bd01a89b954a0655b88c1ee6546e2f82cfd1eec04689af90ad940cde128e8867912d9e415f4a58d7fdcdadf
---
type: Asset
api_version: core/v2
metadata:
name: sensu-ruby-runtime
namespace: default
spec:
url: https://github.com/sensu/sensu-ruby-runtime/releases/download/0.0.4/sensu-ruby-runtime_0.0.4_centos_linux_amd64.tar.gz
sha512: 1c9f0aff8f7f7dfcf07eb75f48c3b7ad6709f2bd68f2287b4bd07979e6fe12c2ab69d1ecf5d4b9b9ed7b96cd4cda5e55c116ea76ce3d9db9ff74538f0ea2317a
So the problem is that event history is null when handleExpireOnResolveEntries
is called, no matter what the history actually is.
I added a bunch of debug logging to that method and then spun up a backend in docker and ran the following:
$ cat silenced.json
{
"type": "CheckConfig",
"api_version": "core/v2",
"metadata": {
"name": "silenced_test",
"namespace": "default"
},
"spec": {
"command": "exit 1",
"handlers": [],
"interval": 60,
"publish": true,
"subscriptions": [
"switches"
],
"timeout": 5,
"ttl": 120
}
}
$ sensuctl create -f silenced.json
Created
then waited for the check to fail:
sensuctl event list
Entity Check Output Status Silenced Timestamp
──────── ─────────────── ────────────────────────────────────────────────────────────────── ──────── ────────── ───────────────────────────────
agent1 keepalive Keepalive last sent from agent1 at 2019-08-12 20:36:51 +0000 UTC 0 false 2019-08-12 16:36:51 -0400 EDT
agent1 silenced_test 1 false 2019-08-12 16:38:26 -0400 EDT
Observing the following in the logs:
backend1_1 | {"component":"schedulerd","level":"debug","msg":"check is not subdued","name":"silenced_test","namespace":"default","scheduler_type":"interval","time":"2019-08-12T20:37:26Z"}
backend1_1 | {"check":"silenced_test","component":"schedulerd","level":"debug","msg":"sending check request","time":"2019-08-12T20:37:26Z","topic":"sensu:check:default:switches"}
agent1_1 | {"component":"agent","content_type":"application/octet-stream","level":"info","msg":"message received","payload_size":63,"time":"2019-08-12T20:37:26Z","type":"check_request"}
agent1_1 | {"component":"agent","level":"info","msg":"scheduling check execution: silenced_test","time":"2019-08-12T20:37:26Z"}
backend1_1 | {"component":"agentd","level":"debug","msg":"session - sending message","payload_size":63,"time":"2019-08-12T20:37:26Z"}
agent1_1 | {"component":"agent","content_type":"application/octet-stream","level":"info","msg":"sending message","payload_size":354,"time":"2019-08-12T20:37:26Z","type":"event"}
agent1_1 | {"component":"agent","level":"info","msg":"sending event","time":"2019-08-12T20:37:26Z"}
backend1_1 | {"component":"eventd","event":{"timestamp":1565642246,"entity":{"entity_class":"agent","system":{"hostname":"agent1","os":"linux","platform":"alpine","platform_family":"alpine","platform_version":"3.8.4","network":{"interfaces":[{"name":"lo","addresses":["127.0.0.1/8"]},{"name":"eth0","mac":"02:42:ac:15:00:03","addresses":["172.21.0.3/16"]}]},"arch":"amd64"},"subscriptions":["switches","entity:agent1"],"last_seen":1565642211,"deregister":false,"deregistration":{},"user":"agent","redact":["password","passwd","pass","api_key","api_token","access_key","secret_key","private_key","secret"],"metadata":{"name":"agent1","namespace":"default"}},"check":{"command":"exit 1","handlers":[],"high_flap_threshold":0,"interval":60,"low_flap_threshold":0,"publish":true,"runtime_assets":null,"subscriptions":["switches"],"proxy_entity_name":"","check_hooks":null,"stdin":false,"subdue":null,"ttl":120,"timeout":5,"round_robin":false,"duration":0.001616361,"executed":1565642246,"history":null,"issued":1565642246,"output":"","status":1,"total_state_change":0,"last_ok":0,"occurrences":0,"occurrences_watermark":0,"output_metric_format":"","output_metric_handlers":null,"env_vars":null,"metadata":{"name":"silenced_test","namespace":"default"}},"metadata":{"namespace":"default"}},"level":"debug","msg":"handling expireOnResolve entries","time":"2019-08-12T20:37:26Z"}
backend1_1 | {"component":"eventd","event":{"timestamp":1565642246,"entity":{"entity_class":"agent","system":{"hostname":"agent1","os":"linux","platform":"alpine","platform_family":"alpine","platform_version":"3.8.4","network":{"interfaces":[{"name":"lo","addresses":["127.0.0.1/8"]},{"name":"eth0","mac":"02:42:ac:15:00:03","addresses":["172.21.0.3/16"]}]},"arch":"amd64"},"subscriptions":["switches","entity:agent1"],"last_seen":1565642211,"deregister":false,"deregistration":{},"user":"agent","redact":["password","passwd","pass","api_key","api_token","access_key","secret_key","private_key","secret"],"metadata":{"name":"agent1","namespace":"default"}},"check":{"command":"exit 1","handlers":[],"high_flap_threshold":0,"interval":60,"low_flap_threshold":0,"publish":true,"runtime_assets":null,"subscriptions":["switches"],"proxy_entity_name":"","check_hooks":null,"stdin":false,"subdue":null,"ttl":120,"timeout":5,"round_robin":false,"duration":0.001616361,"executed":1565642246,"history":null,"issued":1565642246,"output":"","status":1,"total_state_change":0,"last_ok":0,"occurrences":0,"occurrences_watermark":0,"output_metric_format":"","output_metric_handlers":null,"env_vars":null,"metadata":{"name":"silenced_test","namespace":"default"}},"metadata":{"namespace":"default"}},"level":"debug","msg":"hasCheck: true | isResolution: false","time":"2019-08-12T20:37:26Z"}
backend1_1 | {"component":"eventd","level":"debug","msg":"/sensu.io/switchsets/eventd/default/silenced_test/agent1 alive: 120","time":"2019-08-12T20:37:26Z"}
backend1_1 | {"check":"silenced_test","component":"eventd","entity":"agent1","level":"info","msg":"check TTL reset","namespace":"default","previous_status":"dead","status":"alive","time":"2019-08-12T20:37:26Z"}
backend1_1 | {"check_name":"silenced_test","check_namespace":"default","component":"pipelined","entity_name":"agent1","entity_namespace":"default","hooks":null,"level":"debug","msg":"received event","silenced":[],"time":"2019-08-12T20:37:26Z","timestamp":1565642246}
Since this was the first occurrence, history
being null
might make sense.
I then created a silence and waited for the check to run again so that it was definitely flagged as silenced (confirming that the silence matched the event correctly):
sensuctl silenced create --check silenced_test --reason testing --expire-on-resolve
Created
sensuctl silenced list
Name Subscription Check Begin Expire ExpireOnResolve Creator Reason Namespac
───────────────── ────────────── ─────────────── ───────────────────── ──────── ───────────────── ───────── ───────── ───────────
*:silenced_test silenced_test 12 Aug 19 16:39 EDT -1s true admin testing default
<wait a little bit>
sensuctl event list
Entity Check Output Status Silenced Timestamp
──────── ─────────────── ────────────────────────────────────────────────────────────────── ──────── ────────── ───────────────────────────────
agent1 keepalive Keepalive last sent from agent1 at 2019-08-12 20:36:51 +0000 UTC 0 false 2019-08-12 16:36:51 -0400 EDT
agent1 silenced_test 1 true 2019-08-12 16:40:26 -0400 EDT
Then I forced the check into an OK state, and observed that event.IsResolution()
in handleExpireOnResolveEntries
continued to resolve to false
, because event.check.history
was nil
instead of containing the expected history (which appears when running sensuctl event info --format json
:
sensuctl create -f fix_silenced.json # where this is the same as silenced.json but exit 0 instead of exit 1
Logs:
backend1_1 | {"check":"silenced_test","component":"schedulerd","level":"debug","msg":"sending check request","time":"2019-08-12T20:41:26Z","topic":"sensu:check:default:switches"}
backend1_1 | {"component":"agentd","level":"debug","msg":"session - sending message","payload_size":63,"time":"2019-08-12T20:41:26Z"}
agent1_1 | {"component":"agent","content_type":"application/octet-stream","level":"info","msg":"message received","payload_size":63,"time":"2019-08-12T20:41:26Z","type":"check_request"}
agent1_1 | {"component":"agent","level":"info","msg":"scheduling check execution: silenced_test","time":"2019-08-12T20:41:26Z"}
agent1_1 | {"component":"agent","content_type":"application/octet-stream","level":"info","msg":"sending message","payload_size":351,"time":"2019-08-12T20:41:26Z","type":"event"}
agent1_1 | {"component":"agent","level":"info","msg":"sending event","time":"2019-08-12T20:41:26Z"}
backend1_1 | {"component":"eventd","event":{"timestamp":1565642486,"entity":{"entity_class":"agent","system":{"hostname":"agent1","os":"linux","platform":"alpine","platform_family":"alpine","platform_version":"3.8.4","network":{"interfaces":[{"name":"lo","addresses":["127.0.0.1/8"]},{"name":"eth0","mac":"02:42:ac:15:00:03","addresses":["172.21.0.3/16"]}]},"arch":"amd64"},"subscriptions":["switches","entity:agent1"],"last_seen":1565642211,"deregister":false,"deregistration":{},"user":"agent","redact":["password","passwd","pass","api_key","api_token","access_key","secret_key","private_key","secret"],"metadata":{"name":"agent1","namespace":"default"}},"check":{"command":"exit 0","handlers":[],"high_flap_threshold":0,"interval":60,"low_flap_threshold":0,"publish":true,"runtime_assets":null,"subscriptions":["switches"],"proxy_entity_name":"","check_hooks":null,"stdin":false,"subdue":null,"ttl":120,"timeout":5,"round_robin":false,"duration":0.00166182,"executed":1565642486,"history":null,"issued":1565642486,"output":"","status":0,"total_state_change":0,"last_ok":0,"occurrences":0,"occurrences_watermark":0,"silenced":["*:silenced_test"],"output_metric_format":"","output_metric_handlers":null,"env_vars":null,"metadata":{"name":"silenced_test","namespace":"default"}},"metadata":{"namespace":"default"}},"level":"debug","msg":"handling expireOnResolve entries","time":"2019-08-12T20:41:26Z"}
backend1_1 | {"component":"eventd","event":{"timestamp":1565642486,"entity":{"entity_class":"agent","system":{"hostname":"agent1","os":"linux","platform":"alpine","platform_family":"alpine","platform_version":"3.8.4","network":{"interfaces":[{"name":"lo","addresses":["127.0.0.1/8"]},{"name":"eth0","mac":"02:42:ac:15:00:03","addresses":["172.21.0.3/16"]}]},"arch":"amd64"},"subscriptions":["switches","entity:agent1"],"last_seen":1565642211,"deregister":false,"deregistration":{},"user":"agent","redact":["password","passwd","pass","api_key","api_token","access_key","secret_key","private_key","secret"],"metadata":{"name":"agent1","namespace":"default"}},"check":{"command":"exit 0","handlers":[],"high_flap_threshold":0,"interval":60,"low_flap_threshold":0,"publish":true,"runtime_assets":null,"subscriptions":["switches"],"proxy_entity_name":"","check_hooks":null,"stdin":false,"subdue":null,"ttl":120,"timeout":5,"round_robin":false,"duration":0.00166182,"executed":1565642486,"history":null,"issued":1565642486,"output":"","status":0,"total_state_change":0,"last_ok":0,"occurrences":0,"occurrences_watermark":0,"silenced":["*:silenced_test"],"output_metric_format":"","output_metric_handlers":null,"env_vars":null,"metadata":{"name":"silenced_test","namespace":"default"}},"metadata":{"namespace":"default"}},"level":"debug","msg":"hasCheck: true | isResolution: false","time":"2019-08-12T20:41:26Z"}
backend1_1 | {"component":"eventd","level":"debug","msg":"/sensu.io/switchsets/eventd/default/silenced_test/agent1 alive: 120","time":"2019-08-12T20:41:26Z"}
backend1_1 | {"check_name":"silenced_test","check_namespace":"default","component":"pipelined","entity_name":"agent1","entity_namespace":"default","hooks":null,"level":"debug","msg":"received event","silenced":["*:silenced_test"],"time":"2019-08-12T20:41:26Z","timestamp":1565642486}
All of these log messages are coming from the second log statement here:
func handleExpireOnResolveEntries(ctx context.Context, event *corev2.Event, store store.Store) error {
l := logger.WithField("event", event)
l.Debug("handling expireOnResolve entries")
// Make sure we have a check and that the event is a resolution
if !event.HasCheck() || !event.IsResolution() {
l.Debugf("hasCheck: %t | isResolution: %t", event.HasCheck(), event.IsResolution())
return nil
}
...
because !event.IsResolution()
is incorrectly evaluating to true
.
I believe the reason history is nil here is that the event being passed to handleExpireOnResolveEntries
is the event as received from the agent or API, that has not yet been merged with an existing event in store. See:
https://github.com/sensu/sensu-go/blob/88f30484eb2f68eeb151e53baaf68bd2ac095ac0/backend/eventd/eventd.go#L222-L230
The event is merged with existing events in the store by e.eventStore.UpdateEvent(ctx, event)
.
The easiest way to fix this might be to just move the call to handleExpireOnResolveEntries
to after the call to e.eventStore.UpdateEvent(ctx, event)
.
I still think that Sensu should (at least periodically) clean up any expire_on_resolve
silenced entries that correspond to passing events, because it is possible that the deletion here will fail and having an open expire_on_resolve
silence for a healthy check is unexpected and dangerous. Perhaps that should be a separate feature request to this though, because it requires some thought to approach as checking for silences in the store for every received passing check may have performance implications.
Thanks for explaining the issue and potential solution in detail @roganartu! I'm able to reproduce the issue with the following resources, and manually resolving the event with sensuctl (sensuctl event resolve [entity] [check]
). I'll take everything you've listed in consideration with my solution 😄
type: Namespace
api_version: core/v2
metadata:
name: silence
spec:
name: silence
---
type: CheckConfig
api_version: core/v2
metadata:
name: expire-on-resolve
namespace: silence
spec:
check_hooks: null
command: "false"
env_vars: null
handlers: []
high_flap_threshold: 0
interval: 60
low_flap_threshold: 0
output_metric_format: ""
output_metric_handlers: null
proxy_entity_name: ""
publish: true
round_robin: false
runtime_assets: null
stdin: false
subdue: null
subscriptions:
- silence
timeout: 0
ttl: 0
---
type: Silenced
api_version: core/v2
metadata:
name: '*:expire-on-resolve'
namespace: silence
spec:
begin: 1565897528
check: expire-on-resolve
creator: admin
expire: -1
expire_on_resolve: true
reason: test
The
expire_on_resolve
property of a Silenced entry is intended to cause the silence to automatically be deleted when the event resolves, however I have encountered an event in an OK state with an open Silence that hasexpire_on_resolve
set toTrue
.Expected Behavior
A Silence with
expire_on_resolve
set toTrue
should be deleted when the event status is0
.Current Behavior
I have an event with a status of
0
that also has a active Silence withexpire_on_resolve
set toTrue
.Here are the details. I changed the hostname, check name, and output to placeholders for simplicity.
Note the
begin
timestamp is several days ago, while the check only recovered earlier today. I would have expected this Silence to have been deleted when the check transitioned fromstatus == 2
tostatus == 0
in the history above.Steps to Reproduce (for bugs)
I have not attempted to reproduce with a new silence and a fake check so I am unsure if this is a bug for all Silenced entries or just some kind of race condition.
There are no entries in the logs matching the term
silence
that aren't from theapid
component, so I'm not sure where to start looking at thisPossible Solution
I'm not sure why this is occuring from a quick look over the source here:
https://github.com/sensu/sensu-go/blob/27ad8a3db009abfee339de2ad649a7f74266291f/backend/eventd/silenced.go#L54-L80
However, I would personally argue that if it's a transient error then only applying
expire_on_resolve
to DOWN -> OK transitions makes it confusing to users and less robust. Anexpire_on_resolve
silence for an OK alert doesn't really make any sense to me. I would imagine that it was added with the intent of allowing users to create an "ack"-like silence, that only persists as long as the event is in a down state.In any case, I think this pipeline probably needs more logging to determine why this is occurring. There doesn't seem to be any logging during the happy-path where a silence is successfully deleted, which makes it difficult to determine whether this code path was even triggered as expected.
Your Environment