sensu / sensu-go

Simple. Scalable. Multi-cloud monitoring.
https://sensu.io
MIT License
1.01k stars 177 forks source link

Proxy device/entity gets recreated after deletion due to ongoing scheduling #3957

Open mcatngena opened 4 years ago

mcatngena commented 4 years ago

Expected Behavior

When proxy device is removed from the Sensu inventory, it should be deleted for ever.

Current Behavior

You have proxy entity which have several proxy checks assigned with different scheduling times (from 60s to couple of hours) with splay feature enabled. When proxy entity is removed, very last scheduling is still active internally and results from this scheduling are being executed and received. When the result is received and proxy device does not exist anymore, device is simply created automatically (that's well known concept which plays against this use case). I think it is caused by splay.

When you have check with interval 10 mins and splay 90%, then sensu has 9 minutes within which sensu can cover execution over all devices which are matching criteria. When the time comes and internal scheduler is executed (e.g. 1:00pm), it receives input which all devices need to be covered, let's say device A, B and C. Device A gets real execution at 1:03, device B gets gets real execution at 1:06 and C gets 1:09. But if device C is removed e.g. 1:05pm, the scheduling was already triggered and scheduler tries to proceed with real execution at 1:09 but device does not exist anymore so check command fails and result is received under same device name so it is created from scratch - but including huge side effect - all client metadata are missing because check result contains only limited client metadata - not full metadata as when proxy client is created via API with all the labels and so on.

In some situations, it has avalanche effect because if this happens before some other check with greater interval is executed and that check relies on client metadata, it fails obviously because there is just dummy device with no metadata.

Possible Solution

Go and check manually for couple of hours if device is not created again and remove it again.

Steps to Reproduce (for bugs)

  1. Create proxy entity and some proxy checks which would match this entity. Use splay e.g. 90% and scheduling 10 minutes, ideally different scheduling for each check. 2.Let it sit for a while so all checks are properly scheduled 3.Remove the device. 4.Within next minutes, observe if device is still removed or it was created from check results again. If in dashboard, refresh the page several times within next minutes.

Context

It creates false events and also mess the manged inventory.

Your Environment

5.21.0 compiled from source; cluster with 3 nodes (for sensuctl, agents and backend) embedded etcd RedHat 7.8 Virtual Machine on RedHat Virtualization cluster Proxy checks are configured to be scheduled round robin with splay enabled

ccressent commented 4 years ago

This seems related to this option we had in Classic: https://docs.sensu.io/sensu-core/latest/api/clients/#clientsclient-delete-specification

We'll need to have a discussion about this for Sensu Go.

mcatngena commented 3 years ago

@ccressent Hi, did you conclude on something?

mcatngena commented 3 years ago

@ccressent @nikkictl @calebhailey Any progress on this? Cheers:)

calebhailey commented 3 years ago

@mcbsd I think your original assessment is correct, that this is caused by the splay configuration option.

We will have to investigate whether entity deletion can be enhanced to also delete an enqueued check request. Alternatively, we might be able to add validation to the scheduler to verify if an entity still exists before sending a proxy request with splay configuration.

In the interim we might need to document this as a known issue specifically related to the use of splay.

mcatngena commented 3 years ago

Thanks for the update.