Check scheduler got into strange state

roganartu commented 5 years ago

After updating a check config, the IntervalScheduler for the check logged that it was restarting, but never appeared to do so.

Expected Behavior

After seeing the following log entries:

Jul 09 09:52:29 shanetmon1 sensu-backend[22273]: {"component":"schedulerd","level":"info","msg":"check configs updated","time":"2019-07-09T09:52:29-04:00"}
Jul 09 09:52:29 shanetmon1 sensu-backend[22273]: {"component":"schedulerd","level":"info","msg":"restarting scheduler","time":"2019-07-09T09:52:29-04:00"}
Jul 09 09:52:29 shanetmon1 sensu-backend[22273]: {"component":"apid","duration":"2.193ms","level":"info","method":"PUT","msg":"request completed","path":"/api/core/v2/namespaces/default/checks/foo","size":787,"status":200,"time":"2019-07-09T09:52:29-04:00"}

I would expect to see either interval schedule has changed or check schedule has not changed logged also, according to https://github.com/sensu/sensu-go/blob/9b3c09c5071f1d35b8cc6a3711bfcba5081a446b/backend/schedulerd/interval_scheduler.go#L107-L117 which should have been called after the restarting scheduler log message was sent by https://github.com/sensu/sensu-go/blob/9b3c09c5071f1d35b8cc6a3711bfcba5081a446b/backend/schedulerd/check_watcher.go#L150-L153

eg: from a healthy node:

Jul 09 14:31:01 shanetmon1 sensu-backend[14125]: {"component":"schedulerd","level":"info","msg":"check configs updated","time":"2019-07-09T14:31:01-04:00"}
Jul 09 14:31:01 shanetmon1 sensu-backend[14125]: {"component":"schedulerd","level":"info","msg":"restarting scheduler","time":"2019-07-09T14:31:01-04:00"}
Jul 09 14:31:01 shanetmon1 sensu-backend[14125]: {"component":"schedulerd","level":"info","msg":"check schedule has not changed","name":"foo","namespace":"default","scheduler_type":"interval","time":"2019-07-09T14:31:01-04:00"}

Current Behavior

The above logs from the scheduler are the last logs present from the scheduler (over a period of around 5 hours) despite multiple check config changes that should have triggered scheduler restarts. My understanding from the source (and previous observations of logs) is that every check config change should trigger some kind of scheduler message in the logs, even if it's just confirming the schedule didn't change. The complete absence of logs since the entries I pasted above suggests that the scheduler or the checkWatcher got into some kind of broken state.

Checks continued to be scheduled at existing intervals, so the scheduler still appears to be running in some capacity, but changes to check intervals do not trigger any logs from the scheduler nor do they change the interval checks are being scheduled at.

Additionally, we have a check that doesn't currently match any entities which, in a normal state, periodically logs the following:

Jul 09 14:24:02 shanetmon1 sensu-backend[14125]: {"check":"bar","component":"schedulerd","level":"warning","msg":"no matching entities, check will not be published","namespace":"default","time":"2019-07-09T14:24:02-04:00"}
Jul 09 14:29:02 shanetmon1 sensu-backend[14125]: {"check":"bar","component":"schedulerd","level":"warning","msg":"no matching entities, check will not be published","namespace":"default","time":"2019-07-09T14:29:02-04:00"}

however no logs matching this were emitted since the scheduler restart failure. After restarting sensu-backend the logs returned.

Steps to Reproduce (for bugs)

I am unsure how the backend got into this state. It is a sole master, so there's no clustering concerns. If it happens again I will dig further to try and get some reproduction steps.

Restarting the Sensu backend forcibly restarts the schedulers and hence obviously resolves the issue.

Your Environment

Sensu version used (sensuctl, sensu-backend, and/or sensu-agent):

╰ /usr/local/sensu/sensu-backend version
sensu-backend version 5.10.2#13bafd5, build 13bafd56ce60cb8c91838bfa4e65c6ff512c3d0d, built 2019-06-27T17:47:48Z

Installation method (packages, binaries, docker etc.):
Installed manually from binary extracted from .deb. package
Operating System and version (e.g. Ubuntu 14.04): Debian Stretch

echlebek commented 4 years ago

@roganartu just curious, have you observed this behaviour in recent releases?

roganartu commented 4 years ago

Yes I noticed this again the other week, it seems sometimes after config deploys the schedulers take a long time to all restart.

Some logs attached from after a bamboo rollout that triggered some PUTs to /api/.../checks. You can see it takes several hours for these changes to propagate to all schedulers, I am unsure why. The timestamp shown is the systemd log time, so it's definitely not an Elasticsearch ingestion issue.

Screenshot from 2020-03-31 15-07-52

We are currently running a fork of 5.17.2 (no changes anywhere near the scheduler in the fork). I am planning to upgrade the fork to 5.19.0 probably later this week.

roganartu commented 4 years ago

fyi I continue to experience this (or something like it) regularly running 5.19.1 to the point where I now consider the check scheduler to be basically immutable in the absence of a backend restart.

I was, however, finally able to reproduce this a few times inside a containerised e2e testing env and noticed the following interesting behaviour:

sensuctl check execute successfully identified the expected target host and correctly scheduled the check for it, only for the resulting event to become stale on expiry of the TTL. This seems to confirm that it is not a subscription targetting issue.
The e2e env I reproduced this inside has only 20 entities and is very low load. This perhaps rules out load being the cause which was the only difference I could previously tell between our prod env (with thousands of entities) and a dev container (with only a couple) where I have previously been unable to reproduce this.
I was no longer able to reproduce after restarting the container running the backend, with all subsequent check config changes successfully restarting the scheduler as expected.

My reproduction involved creating a new check with publish:true and an interval <10s. After this failed to be scheduled I changed the interval multiple times and the subscription a few (ultimately going so far as to set it to entity:agent1-agent). Each time the config was changed it was applied with sensuctl create -f ..., the backend emitted the expected check configs updated, occasionally accompanied by the check schedule changed and restarting scheduler log entries, the output of sensuctl check info ... showed the updated details, and the check continued to never be scheduled for execution.

After restarting the backend container the check started being scheduled immediately and the same interval changes that failed to apply before successfully restarted the scheduler when re-applied. I have not been able to reproduce it again in this testing env since this backend restart, even after completely nuking and recreating the environment. I don't know what it is that is causing this broken state to be entered, but it seems to be triggered reliably in our prod, and very unreliably outside of it.

Having to restart the backend to pickup a minor config change is rather frustrating. This seems to happen basically every time we rollout a config change in prod now. We collect logs for everything and would love to provide more info so that this can be finally fixed, but the scheduler is not well instrumented (at least still in 5.19.1) so I'm not sure what else I can do here but keep bouncing procs.

echlebek commented 4 years ago

Unfortunately I'm not sure what to make of this issue. It isn't something we are observing in QA, it isn't something we are observing in our own monitoring, and we haven't had anyone report a similar issue. :confused:

Perhaps we need to increase the amount of debug logging the scheduler is emitting, so we can trace its state more confidently.

roganartu commented 4 years ago

Yeah it's got me stumped too, I've tried a few times to reproduce it outside of prod and haven't been able to until today which is why I commented again. I appreciate that bugs that don't have clear reproduction steps are the most difficult to track down and fix, and I agree that adding some more logging to the scheduler would be a good next step.

I've got to upgrade our fork to 5.20 at some point anyway, so I might just add some logging in when I do and see what it yields.

sensu / sensu-go