Closed adammcdonagh closed 5 years ago
I just reproduced this issue by spinning up a 3-node cluster, adding the following proxy check and creating a partition:
type: CheckConfig
api_version: core/v2
metadata:
name: proxy_check
namespace: default
spec:
command: echo pong
interval: 60
proxy_entity_name: ping
publish: true
round_robin: true
subscriptions:
- proxy
To create a network partition, I've used the following iptables command: itables -A INPUT -p tcp --dport 2380 -j DROP
(you need to do it on 2 of the 3 nodes in order to have no leader).
Then, I've received the following errors:
{"component":"etcd","level":"warning","msg":"timed out waiting for read index response (local node might have slow network)","pkg":"etcdserver","time":"2019-08-12T18:32:57Z"}
{"component":"schedulerd","error":"error while starting ring watcher: etcdserver: request timed out","level":"error","msg":"error scheduling check","name":"proxy_check","namespace":"default","scheduler_type":"round-robin interval","time":"2019-08-12T18:32:57Z"}
{"component":"schedulerd","level":"warning","msg":"shutting down scheduler","name":"proxy_check","namespace":"default","scheduler_type":"round-robin interval","time":"2019-08-12T18:32:57Z"}
Finally, I've removed the iptables rules, which brought back the cluster back online, but the no round-robin check results were received after that:
# sensuctl cluster health
=== Etcd Cluster ID: 80c05ee29d12458
ID Name Error Healthy
────────────────── ────────── ─────── ─────────
8927110dc66458af backend0 true
a23843c228b27241 backend2 true
bb75bf8de77581cc backend1 true
# sensuctl event list
Entity Check Output Status Silenced Timestamp
─────────────────────── ───────────── ───────────────────────────────────────────────────────────────────────────────── ──────── ────────── ───────────────────────────────
localhost.localdomain keepalive Keepalive last sent from localhost.localdomain at 2019-08-12 18:45:28 +0000 UTC 0 false 2019-08-12 18:45:28 +0000 UTC
ping proxy_check pong 0 false 2019-08-12 18:31:48 +0000 UTC
Looking at the code, it appears we might need a retry mechanism around here: https://github.com/sensu/sensu-go/blob/c7c0221125a842e5ed704f48b9633207b1011bf5/backend/ringv2/ringv2.go#L358-L363
While I was unable to reproduce this issue with Docker, I was able to reproduce as @palourde described it in our AWS staging environment. The logs confirm that the issue is being spewed from the ring, however a retry mechanism might not be sufficient for quorum loss that lasts more than a reasonable length of time. Are you suggesting that the retry loop is infinite, or do you have a proposed retry length?
I was also able to confirm that TessenD emitted the same log: error while starting ring watcher: etcdserver: request timed out
. After quorum loss and recovery, TessenD stopped sending data that was coordinated by the ring in a round-robin fashion (other metrics such as events processed were unaffected).
This leads me to believe that another potential solution would be for both daemons to restart the ring (or repopulate the ring pools) in the event of a ring error. SchedulerD would need to restart all schedulers after quorum is reestablished, but the retry mechanism sounds like it could be a cleaner approach. What are your thoughts @palourde (and @echlebek because he is the ring master)?
Let's try adding an infinite retry to this for now. Something with backoff so it doesn't overwhelm the service when it comes back. However, I'm starting to think that we should be crashing more often in these types of scenarios.
That's a pretty big behaviour change though, and we'll need to do some discussion and testing around it.
Hello ,
@echlebek , @palourde , @nikkictl , @adammcdonagh , @jspaleta
I am using sensu latest oss version and my checks stops getting scheduled
{"component":"schedulerd","cron":"0 /2 ","error":"error while starting ring watcher: context canceled","level":"error","msg":"error scheduling check","name":"tech-elkes-data-backup-validation","namespace":"default","scheduler_type":"round-robin cron","time":"2024-02-07T16:27:21Z"} {"component":"schedulerd","cron":"/35 ","error":"error while starting ring watcher: context canceled","level":"error","msg":"error scheduling check","name":"tech-netty-thread-status-alert","namespace":"default","scheduler_type":"round-robin cron","time":"2024-02-07T16:27:21Z"} {"component":"schedulerd","cron":"/4 ","error":"error while starting ring watcher: context canceled","level":"error","msg":"error scheduling check","name":"tech-nginx-basic-status","namespace":"default","scheduler_type":"round-robin cron","time":"2024-02-07T16:27:22Z"} {"component":"schedulerd","cron":"/4 ","error":"error while starting ring watcher: context canceled","level":"error","msg":"error scheduling check","name":"tech-nginx-connection-alert","namespace":"default","scheduler_type":"round-robin cron","time":"2024-02-07T16:27:22Z"} {"component":"schedulerd","cron":"/4 ","error":"error while starting ring watcher: context canceled","level":"error","msg":"error scheduling check","name":"tech-nginx-mdm-ui-server-status","namespace":"default","scheduler_type":"round-robin cron","time":"2024-02-07T16:27:22Z"}
Sample snippet of my check
{ "api_version":"core/v2", "type":"Check", "metadata":{ "namespace":"default", "name":"eventhubthroughput-alert", "annotations": { "sensu.io.json_attributes": "{\"type\":\"standard\",\"occurrences\":5,\"refresh\":3600}" } }, "spec":{ "command":"python3.11 path_file/file_name.py", "subscriptions":[ "worker" ], "publish":true, "round_robin":true, "cron": "0 15 *", "handlers":[ "tester_handler", "alert_handler", "resolve_handler" ], "proxy_entity_name":"proxyclient" } }
Any update guys
We have 2 round robin checks. For both of them, only our Sensu servers are subscribed.
I'm not sure if it's only round robin checks that this effects, however both times it has been the same 2 checks that stop getting scheduled.
Twice over the past week, these checks have stopped being scheduled... The only way I have found to resolve the issue is to restart the backend instances...
A message is logged in the backend log showing this.. But never appears to get rescheduled.
Expected Behavior
Checks should never stop being scheduled unless they are no longer published.
Current Behavior
Round robin checks stop being triggered after a random amount of time.
It looks like its happening after a loss of leader:
Steps to Reproduce (for bugs)
Unknown. Maybe can be triggered by forcing a loss of leader?
Context
Checks that don't get scheduled need to be manually identified... We have a script to check this, however in this case, the check that is no longer getting scheduled is that script...
Your Environment
RHEL 7, Sensu Go v5.12