Round robin check stops getting scheduled

adammcdonagh commented 5 years ago

We have 2 round robin checks. For both of them, only our Sensu servers are subscribed.

I'm not sure if it's only round robin checks that this effects, however both times it has been the same 2 checks that stop getting scheduled.

Twice over the past week, these checks have stopped being scheduled... The only way I have found to resolve the issue is to restart the backend instances...

A message is logged in the backend log showing this.. But never appears to get rescheduled.

Expected Behavior

Checks should never stop being scheduled unless they are no longer published.

Current Behavior

Round robin checks stop being triggered after a random amount of time.

It looks like its happening after a loss of leader:

Jul 21 05:01:19 vlxxxx003 sensu-backend: {"component":"schedulerd","error":"etcdserver: no leader","level":"error","msg":"error scheduling check","name":"check-metrics-remote","namespace":"default","scheduler_type":"round-robin interval","time":"2019-07-21T05:01:19+01:00"}
Jul 21 05:01:29 vlxxxx003 sensu-backend: {"component":"schedulerd","error":"error while starting ring watcher: etcdserver: request timed out","level":"error","msg":"error scheduling check","name":"check-metrics-remote","namespace":"default","scheduler_type":"round-robin interval","time":"2019-07-21T05:01:29+01:00"}
Jul 21 05:01:29 vlxxxx003 sensu-backend: {"component":"schedulerd","level":"warning","msg":"shutting down scheduler","name":"check-metrics-remote","namespace":"default","scheduler_type":"round-robin interval","time":"2019-07-21T05:01:29+01:00"}

Steps to Reproduce (for bugs)

Unknown. Maybe can be triggered by forcing a loss of leader?

Context

Checks that don't get scheduled need to be manually identified... We have a script to check this, however in this case, the check that is no longer getting scheduled is that script...

Your Environment

RHEL 7, Sensu Go v5.12

palourde commented 5 years ago

I just reproduced this issue by spinning up a 3-node cluster, adding the following proxy check and creating a partition:

type: CheckConfig
api_version: core/v2
metadata:
  name: proxy_check
  namespace: default
spec:
  command: echo pong
  interval: 60
  proxy_entity_name: ping
  publish: true
  round_robin: true
  subscriptions:
  - proxy

To create a network partition, I've used the following iptables command: itables -A INPUT -p tcp --dport 2380 -j DROP (you need to do it on 2 of the 3 nodes in order to have no leader).

Then, I've received the following errors:

{"component":"etcd","level":"warning","msg":"timed out waiting for read index response (local node might have slow network)","pkg":"etcdserver","time":"2019-08-12T18:32:57Z"}
{"component":"schedulerd","error":"error while starting ring watcher: etcdserver: request timed out","level":"error","msg":"error scheduling check","name":"proxy_check","namespace":"default","scheduler_type":"round-robin interval","time":"2019-08-12T18:32:57Z"}
{"component":"schedulerd","level":"warning","msg":"shutting down scheduler","name":"proxy_check","namespace":"default","scheduler_type":"round-robin interval","time":"2019-08-12T18:32:57Z"}

Finally, I've removed the iptables rules, which brought back the cluster back online, but the no round-robin check results were received after that:

# sensuctl cluster health
=== Etcd Cluster ID: 80c05ee29d12458
         ID            Name     Error   Healthy
 ────────────────── ────────── ─────── ─────────
  8927110dc66458af   backend0           true
  a23843c228b27241   backend2           true
  bb75bf8de77581cc   backend1           true
# sensuctl event list
         Entity              Check                                          Output                                        Status   Silenced             Timestamp
 ─────────────────────── ───────────── ───────────────────────────────────────────────────────────────────────────────── ──────── ────────── ───────────────────────────────
  localhost.localdomain   keepalive     Keepalive last sent from localhost.localdomain at 2019-08-12 18:45:28 +0000 UTC        0   false      2019-08-12 18:45:28 +0000 UTC
  ping                    proxy_check   pong                                                                                   0   false      2019-08-12 18:31:48 +0000 UTC

Looking at the code, it appears we might need a retry mechanism around here: https://github.com/sensu/sensu-go/blob/c7c0221125a842e5ed704f48b9633207b1011bf5/backend/ringv2/ringv2.go#L358-L363

nikkictl commented 5 years ago

While I was unable to reproduce this issue with Docker, I was able to reproduce as @palourde described it in our AWS staging environment. The logs confirm that the issue is being spewed from the ring, however a retry mechanism might not be sufficient for quorum loss that lasts more than a reasonable length of time. Are you suggesting that the retry loop is infinite, or do you have a proposed retry length?

I was also able to confirm that TessenD emitted the same log: error while starting ring watcher: etcdserver: request timed out. After quorum loss and recovery, TessenD stopped sending data that was coordinated by the ring in a round-robin fashion (other metrics such as events processed were unaffected).

This leads me to believe that another potential solution would be for both daemons to restart the ring (or repopulate the ring pools) in the event of a ring error. SchedulerD would need to restart all schedulers after quorum is reestablished, but the retry mechanism sounds like it could be a cleaner approach. What are your thoughts @palourde (and @echlebek because he is the ring master)?

echlebek commented 5 years ago

Let's try adding an infinite retry to this for now. Something with backoff so it doesn't overwhelm the service when it comes back. However, I'm starting to think that we should be crashing more often in these types of scenarios.

That's a pretty big behaviour change though, and we'll need to do some discussion and testing around it.

Shivani3351 commented 9 months ago

Hello ,

@echlebek , @palourde , @nikkictl , @adammcdonagh , @jspaleta

I am using sensu latest oss version and my checks stops getting scheduled

{"component":"schedulerd","cron":"0 /2 ","error":"error while starting ring watcher: context canceled","level":"error","msg":"error scheduling check","name":"tech-elkes-data-backup-validation","namespace":"default","scheduler_type":"round-robin cron","time":"2024-02-07T16:27:21Z"} {"component":"schedulerd","cron":"/35 ","error":"error while starting ring watcher: context canceled","level":"error","msg":"error scheduling check","name":"tech-netty-thread-status-alert","namespace":"default","scheduler_type":"round-robin cron","time":"2024-02-07T16:27:21Z"} {"component":"schedulerd","cron":"/4 ","error":"error while starting ring watcher: context canceled","level":"error","msg":"error scheduling check","name":"tech-nginx-basic-status","namespace":"default","scheduler_type":"round-robin cron","time":"2024-02-07T16:27:22Z"} {"component":"schedulerd","cron":"/4 ","error":"error while starting ring watcher: context canceled","level":"error","msg":"error scheduling check","name":"tech-nginx-connection-alert","namespace":"default","scheduler_type":"round-robin cron","time":"2024-02-07T16:27:22Z"} {"component":"schedulerd","cron":"/4 ","error":"error while starting ring watcher: context canceled","level":"error","msg":"error scheduling check","name":"tech-nginx-mdm-ui-server-status","namespace":"default","scheduler_type":"round-robin cron","time":"2024-02-07T16:27:22Z"}

Sample snippet of my check

{ "api_version":"core/v2", "type":"Check", "metadata":{ "namespace":"default", "name":"eventhubthroughput-alert", "annotations": { "sensu.io.json_attributes": "{\"type\":\"standard\",\"occurrences\":5,\"refresh\":3600}" } }, "spec":{ "command":"python3.11 path_file/file_name.py", "subscriptions":[ "worker" ], "publish":true, "round_robin":true, "cron": "0 15 *", "handlers":[ "tester_handler", "alert_handler", "resolve_handler" ], "proxy_entity_name":"proxyclient" } }

Shivani3351 commented 9 months ago

Any update guys

sensu / sensu-go