sensu / sensu-go

Simple. Scalable. Multi-cloud monitoring.
https://sensu.io
MIT License
1.02k stars 175 forks source link

Proxy checks are scheduled for deleted entities #3178

Open roganartu opened 5 years ago

roganartu commented 5 years ago

A race condition exists in the scheduling of proxy checks, causing them to sometimes be scheduled for entities that have been deleted.

Expected Behavior

Checks should not be scheduled for an entity after it has been deleted.

Current Behavior

Checks are sometimes scheduled for deleted entities. The frequency this occurs is proportional to the interval length, number of entities, and the splay percentage used. It can theoretically occur even with a splay percentage of 0%, but it is far easier to observe with a long interval and high splay percentage.

Possible Solution

The root cause is that Sensu operates off a cached view of entities when scheduling proxy checks: https://github.com/sensu/sensu-go/blob/27ad8a3db009abfee339de2ad649a7f74266291f/backend/schedulerd/executor.go#L248-L257

The time.Sleep(splay) is why this is easier to observe with a higher splay percentage.

Perhaps an easy broad solution is to drop events (maybe with a log message) if they are being submitted by an agent from a scheduled execution and the entity does not exist. I'm not sure what field(s) would differentiate these events from ones submitted via the agent POST /events endpoint, for which one might want a new entity created.

Another approach that doesn't remove the race entirely would be to check whether the entity still exists before pushing the check request into the message queue. https://github.com/sensu/sensu-go/blob/27ad8a3db009abfee339de2ad649a7f74266291f/backend/schedulerd/executor.go#L58-L84

Even with proper resource-locking on the entity above this can still occur anyway, which is why I suggested the broad event-dropping mitigation above. Consider the following order of events:

  1. Sensu pushes check onto queue
  2. Agent picks up check and begins executing it
  3. Entity is deleted in Sensu
  4. Agent finishes executing check and submits result to Sensu
  5. Sensu re-creates the deleted entity

This is clearly undesirable, but isn't fixed by just addressing the race in schedulerd with a pre-publish resource check or lock.

Steps to Reproduce (for bugs)

  1. Create two entities
  2. Create a proxy check that matches both entities with a large splay percentage and an interval long enough to allow you to perform a manual action (something like 1-2 mins should suffice). This bug will occur no matter what splay percentage or interval you choose, but it is easier to demonstrate with large values as it exacerbates the race condition
  3. Watch the (debug level) backend logs for "component": schedulerd entries
  4. When schedulerd schedules a check for one of the entities, delete the other one with sensuctl entity delete <name>
  5. Wait until the scheduler schedules the check for the deleted entity
  6. See that the entity has been re-created with sensuctl entity info <name>, along with an event for it

Context

With this bug, entities need to be deleted twice (once before the check is erroneously scheduled, and then again after the result from that scheduled check is received) in order to actually be deleted. Additionally, it will cause entities to trigger TTL expiries

Your Environment

palourde commented 5 years ago

I was able to reproduce this bug, here's two basic entities and a proxy check that can be used for reproducing it:

type: CheckConfig
api_version: core/v2
metadata:
  name: proxy-check-echo
  namespace: default
spec:
  command: echo pong
  interval: 120
  proxy_requests:
    entity_attributes:
    - entity.entity_class == 'proxy'
    splay: true
    splay_coverage: 90
  publish: true
  subscriptions:
  - entity:whisky
  timeout: 10
---
type: Entity
api_version: core/v2
metadata:
  name: switch1
  namespace: default
spec:
  entity_class: proxy
  subscriptions: null
---
type: Entity
api_version: core/v2
metadata:
  name: switch2
  namespace: default
spec:
  entity_class: proxy
  subscriptions: null
roganartu commented 5 years ago

fwiw I'm currently testing a fix for this and will push it shortly for discussion. It's turned out to be annoyingly complex mostly due to the fact that agents are quite overbearing in setting the entity in events they process, which prevents the backed from differentiating between all the different types of events an agent may send.

echlebek commented 4 years ago

We should investigate if this is still an issue, as the entity caching layer has been reworked since this was filed.