sensu / sensu-go-fatigue-check-filter

An event filter for Sensu Go for managing alert fatigue
MIT License
6 stars 8 forks source link

incorrect notification interval #13

Closed jeremyj closed 4 years ago

jeremyj commented 4 years ago

Hello,

sensu-go-agent 5.16.1-8521 sensu-go-backend 5.16.1-8521 fatigue-check-filter //assets.bonsai.sensu.io/.../sensu-go-fatigue-check-filter_0.3.2.tar.gz

I have a check configured as so:

type: CheckConfig
api_version: core/v2
metadata:
  annotations:
    fatigue_check/allow_resolution: "true"
    fatigue_check/interval: "3600"
    fatigue_check/occurrences: "1"
  name: check-supervisor
  namespace: default
spec:
  check_hooks: null
  command: /opt/sensu-plugins-ruby/embedded/bin/check-supervisor.rb
  env_vars: null
  handlers:
  - slack
  high_flap_threshold: 0
  interval: 60
  low_flap_threshold: 0
  output_metric_format: ""
  output_metric_handlers: null
  proxy_entity_name: ""
  publish: true
  round_robin: true
  runtime_assets: null
  stdin: false
  subdue: null
  subscriptions:
  - supervisor
  timeout: 0
  ttl: 0

This is the slack handler:

type: Handler
api_version: core/v2
metadata:
  name: slack
  namespace: default
spec:
  command: sensu-slack-handler --channel '#monitoring'
  env_vars:
  - SLACK_WEBHOOK_URL=https://hooks.slack.com/services/xxxxxx
  filters:
  - is_incident
  - fatigue_check
  handlers: null
  runtime_assets:
  - sensu-slack-handler
  timeout: 0
  type: pipe

The filter:

type: EventFilter
api_version: core/v2
metadata:
  name: fatigue_check
  namespace: default
spec:
  action: allow
  expressions:
  - fatigue_check(event)
  runtime_assets:
  - fatigue-check-filter

The entity:

type: Entity
api_version: core/v2
metadata:
  labels:
    mqtt_auth_params: -a /etc/mosquitto/ca_certificates/cacert.pem -u xxx -p xxx
      -n
    mqtt_host: 192.168.12.71
  name: edi
  namespace: default
spec:
  deregister: false
  deregistration: {}
  entity_class: agent
  last_seen: 1578919497
  redact:
  - password
  - passwd
  - pass
  - api_key
  - api_token
  - access_key
  - secret_key
  - private_key
  - secret
  sensu_agent_version: 5.16.1
  subscriptions:
  - system
  - mqtt
  - supervisor
  - teltonika-mqtt
  - fail-test
  - entity:edi
  system:
    arch: amd64
    hostname: edi
    network:
      interfaces:
      - addresses:
        - 127.0.0.1/8
        - ::1/128
        name: lo
      - addresses:
        - fe80::21e:c9ff:fed9:62a3/64
        mac: 00:1e:c9:d9:62:a3
        name: eno1
      - addresses: null
        mac: 00:1e:c9:d9:62:a4
        name: eno2
      - addresses:
        - 192.168.12.71/24
        - fe80::21e:c9ff:fed9:62a3/64
        mac: 00:1e:c9:d9:62:a3
        name: br0
      - addresses:
        - 10.0.3.1/24
        mac: 00:16:3e:00:00:00
        name: lxcbr0
      - addresses:
        - 10.200.200.1/24
        name: wg0
      - addresses:
        - 172.11.0.1/24
        - fe80::9233:dea2:a93e:6952/64
        name: tun0
      - addresses:
        - fe80::fc09:63ff:febd:d061/64
        mac: fe:09:63:bd:d0:61
        name: vethMHFW4P
      - addresses:
        - fe80::fc06:45ff:fe37:e3ca/64
        mac: fe:06:45:37:e3:ca
        name: vethIIEVDU
      - addresses:
        - fe80::fc4c:67ff:fe94:a802/64
        mac: fe:4c:67:94:a8:02
        name: vethVPM60G
      - addresses:
        - fe80::fce0:16ff:fecb:cd2e/64
        mac: fe:e0:16:cb:cd:2e
        name: vethE6205T
      - addresses:
        - fe80::fc8b:7dff:fe72:917d/64
        mac: fe:8b:7d:72:91:7d
        name: vethWPO1HU
      - addresses:
        - fe80::fc04:a1ff:fe25:3e08/64
        mac: fe:04:a1:25:3e:08
        name: veth2BC8IP
      - addresses:
        - fe80::fc8c:30ff:fef7:ea11/64
        mac: fe:8c:30:f7:ea:11
        name: veth1VGWS5
    os: linux
    platform: ubuntu
    platform_family: debian
    platform_version: "18.04"
  user: agent

so I should be getting an alert every 60 minutes, but instead I'm getting a notification every 120 minutes.

What else should I check to find out where the problem is? Thanks

jeremyj commented 4 years ago

OK, since I had been experimenting with various notification filters I reset my backend (rm -rf /var/lib/sensu/sensu-backend/etcd/) and re-imported the above configurations. Now I am getting a notification every 120s and am very confused. Can anyone point me in the right direction?

nixwiz commented 4 years ago

Hi @jeremyj I tried reproducing this issue, but I am seeing notifications happen as expected. Here is what I tried.

I created a debug handler, this simply drops event JSON into a file for any notifications that make it past the filter(s).

type: Handler
api_version: core/v2
metadata:
  name: debug
  namespace: default
spec:
  command: jq . >> /tmp/events.out
  env_vars: null
  filters:
  - is_incident
  - not_silenced
  - fatigue_check
  handlers: null
  runtime_assets: null
  timeout: 0
  type: pipe

I assigned this handler to a check that I could easily inject failures into:

type: CheckConfig
api_version: core/v2
metadata:
  name: http
  namespace: default
  annotations:
    fatigue_check/occurrences: "1"
    fatigue_check/interval: "60"
    fatigue_check/allow_resolution: "true"
spec:
  check_hooks: null
  command: check-http.rb -u http://agent -q 'Welcome to CentOS'
  env_vars: null
  handlers:
  - debug
  high_flap_threshold: 0
  interval: 10
  low_flap_threshold: 0
  output_metric_format: nagios_perfdata
  output_metric_handlers: null
  proxy_entity_name: ""
  publish: true
  round_robin: false
  runtime_assets:
  - sensu/sensu-ruby-runtime
  - sensu-plugins/sensu-plugins-http
  stdin: false
  subdue: null
  subscriptions:
  - linux
  timeout: 10
  ttl: 0

I then injected my failure and based on the above configuration I should see events in my /tmp/events.json file that align with the first occurrence, and each occurrence at 60 second intervals, and finally a resolution event. And that is what I saw when running the following on the debug output.

$ jq '{timestamp: .timestamp, state: .check.state, interval: .check.interval, watermark: .check.occurrences_watermark}' /tmp/events.out
{
  "timestamp": 1578951985,
  "state": "failing",
  "interval": 10,
  "watermark": 1
}
{
  "timestamp": 1578952035,
  "state": "failing",
  "interval": 10,
  "watermark": 6
}
{
  "timestamp": 1578952095,
  "state": "failing",
  "interval": 10,
  "watermark": 12
}
{
  "timestamp": 1578952155,
  "state": "failing",
  "interval": 10,
  "watermark": 18
}
{
  "timestamp": 1578952165,
  "state": "passing",
  "interval": 10,
  "watermark": 18
}

Can you try a similar debug handler configuration for your check?

jeremyj commented 4 years ago

Hi @nixwiz Thanks for your answer. Turns out I was performing this check against 2 entities and the check's round_robin value was set to true, so for each entity I was notified once every 2 times the check was performed.