Open roganartu opened 5 years ago
I was able to reproduce this bug, here's two basic entities and a proxy check that can be used for reproducing it:
type: CheckConfig
api_version: core/v2
metadata:
name: proxy-check-echo
namespace: default
spec:
command: echo pong
interval: 120
proxy_requests:
entity_attributes:
- entity.entity_class == 'proxy'
splay: true
splay_coverage: 90
publish: true
subscriptions:
- entity:whisky
timeout: 10
---
type: Entity
api_version: core/v2
metadata:
name: switch1
namespace: default
spec:
entity_class: proxy
subscriptions: null
---
type: Entity
api_version: core/v2
metadata:
name: switch2
namespace: default
spec:
entity_class: proxy
subscriptions: null
fwiw I'm currently testing a fix for this and will push it shortly for discussion. It's turned out to be annoyingly complex mostly due to the fact that agents are quite overbearing in setting the entity in events they process, which prevents the backed from differentiating between all the different types of events an agent may send.
We should investigate if this is still an issue, as the entity caching layer has been reworked since this was filed.
A race condition exists in the scheduling of proxy checks, causing them to sometimes be scheduled for entities that have been deleted.
Expected Behavior
Checks should not be scheduled for an entity after it has been deleted.
Current Behavior
Checks are sometimes scheduled for deleted entities. The frequency this occurs is proportional to the interval length, number of entities, and the splay percentage used. It can theoretically occur even with a splay percentage of 0%, but it is far easier to observe with a long interval and high splay percentage.
Possible Solution
The root cause is that Sensu operates off a cached view of entities when scheduling proxy checks: https://github.com/sensu/sensu-go/blob/27ad8a3db009abfee339de2ad649a7f74266291f/backend/schedulerd/executor.go#L248-L257
The
time.Sleep(splay)
is why this is easier to observe with a higher splay percentage.Perhaps an easy broad solution is to drop events (maybe with a log message) if they are being submitted by an agent from a scheduled execution and the entity does not exist. I'm not sure what field(s) would differentiate these events from ones submitted via the agent
POST /events
endpoint, for which one might want a new entity created.Another approach that doesn't remove the race entirely would be to check whether the entity still exists before pushing the check request into the message queue. https://github.com/sensu/sensu-go/blob/27ad8a3db009abfee339de2ad649a7f74266291f/backend/schedulerd/executor.go#L58-L84
Even with proper resource-locking on the entity above this can still occur anyway, which is why I suggested the broad event-dropping mitigation above. Consider the following order of events:
This is clearly undesirable, but isn't fixed by just addressing the race in
schedulerd
with a pre-publish resource check or lock.Steps to Reproduce (for bugs)
"component": schedulerd
entriessensuctl entity delete <name>
sensuctl entity info <name>
, along with an event for itContext
With this bug, entities need to be deleted twice (once before the check is erroneously scheduled, and then again after the result from that scheduled check is received) in order to actually be deleted. Additionally, it will cause entities to trigger TTL expiries
Your Environment
.deb
package