Proxy checks are scheduled for deleted entities

roganartu commented 5 years ago

A race condition exists in the scheduling of proxy checks, causing them to sometimes be scheduled for entities that have been deleted.

Expected Behavior

Checks should not be scheduled for an entity after it has been deleted.

Current Behavior

Checks are sometimes scheduled for deleted entities. The frequency this occurs is proportional to the interval length, number of entities, and the splay percentage used. It can theoretically occur even with a splay percentage of 0%, but it is far easier to observe with a long interval and high splay percentage.

Possible Solution

The root cause is that Sensu operates off a cached view of entities when scheduling proxy checks: https://github.com/sensu/sensu-go/blob/27ad8a3db009abfee339de2ad649a7f74266291f/backend/schedulerd/executor.go#L248-L257

The time.Sleep(splay) is why this is easier to observe with a higher splay percentage.

Perhaps an easy broad solution is to drop events (maybe with a log message) if they are being submitted by an agent from a scheduled execution and the entity does not exist. I'm not sure what field(s) would differentiate these events from ones submitted via the agent POST /events endpoint, for which one might want a new entity created.

Another approach that doesn't remove the race entirely would be to check whether the entity still exists before pushing the check request into the message queue. https://github.com/sensu/sensu-go/blob/27ad8a3db009abfee339de2ad649a7f74266291f/backend/schedulerd/executor.go#L58-L84

Even with proper resource-locking on the entity above this can still occur anyway, which is why I suggested the broad event-dropping mitigation above. Consider the following order of events:

Sensu pushes check onto queue
Agent picks up check and begins executing it
Entity is deleted in Sensu
Agent finishes executing check and submits result to Sensu
Sensu re-creates the deleted entity

This is clearly undesirable, but isn't fixed by just addressing the race in schedulerd with a pre-publish resource check or lock.

Steps to Reproduce (for bugs)

Create two entities
Create a proxy check that matches both entities with a large splay percentage and an interval long enough to allow you to perform a manual action (something like 1-2 mins should suffice). This bug will occur no matter what splay percentage or interval you choose, but it is easier to demonstrate with large values as it exacerbates the race condition
Watch the (debug level) backend logs for "component": schedulerd entries
When schedulerd schedules a check for one of the entities, delete the other one with sensuctl entity delete <name>
Wait until the scheduler schedules the check for the deleted entity
See that the entity has been re-created with sensuctl entity info <name>, along with an event for it

Context

With this bug, entities need to be deleted twice (once before the check is erroneously scheduled, and then again after the result from that scheduled check is received) in order to actually be deleted. Additionally, it will cause entities to trigger TTL expiries

Your Environment

Sensu version used (sensuctl, sensu-backend, and/or sensu-agent):

$ sensu-backend version
sensu-backend version 5.11.0#31a0d7c, build 31a0d7c491bf3f4b861c3e203d0eeddeec204e88, built 2019-07-10T15:35:00Z

Installation method (packages, binaries, docker etc.): Installed as standalone binary extracted from .deb package
Operating System and version (e.g. Ubuntu 14.04): Debian Stretch

palourde commented 5 years ago

I was able to reproduce this bug, here's two basic entities and a proxy check that can be used for reproducing it:

type: CheckConfig
api_version: core/v2
metadata:
  name: proxy-check-echo
  namespace: default
spec:
  command: echo pong
  interval: 120
  proxy_requests:
    entity_attributes:
    - entity.entity_class == 'proxy'
    splay: true
    splay_coverage: 90
  publish: true
  subscriptions:
  - entity:whisky
  timeout: 10
---
type: Entity
api_version: core/v2
metadata:
  name: switch1
  namespace: default
spec:
  entity_class: proxy
  subscriptions: null
---
type: Entity
api_version: core/v2
metadata:
  name: switch2
  namespace: default
spec:
  entity_class: proxy
  subscriptions: null

roganartu commented 5 years ago

fwiw I'm currently testing a fix for this and will push it shortly for discussion. It's turned out to be annoyingly complex mostly due to the fact that agents are quite overbearing in setting the entity in events they process, which prevents the backed from differentiating between all the different types of events an agent may send.

echlebek commented 4 years ago

We should investigate if this is still an issue, as the entity caching layer has been reworked since this was filed.

sensu / sensu-go