Open roganartu opened 5 years ago
@roganartu just curious, have you observed this behaviour in recent releases?
Yes I noticed this again the other week, it seems sometimes after config deploys the schedulers take a long time to all restart.
Some logs attached from after a bamboo rollout that triggered some PUTs to /api/.../checks
. You can see it takes several hours for these changes to propagate to all schedulers, I am unsure why. The timestamp shown is the systemd log time, so it's definitely not an Elasticsearch ingestion issue.
We are currently running a fork of 5.17.2 (no changes anywhere near the scheduler in the fork). I am planning to upgrade the fork to 5.19.0 probably later this week.
fyi I continue to experience this (or something like it) regularly running 5.19.1 to the point where I now consider the check scheduler to be basically immutable in the absence of a backend restart.
I was, however, finally able to reproduce this a few times inside a containerised e2e testing env and noticed the following interesting behaviour:
sensuctl check execute
successfully identified the expected target host and correctly scheduled the check for it, only for the resulting event to become stale on expiry of the TTL. This seems to confirm that it is not a subscription targetting issue.My reproduction involved creating a new check with publish:true
and an interval <10s. After this failed to be scheduled I changed the interval multiple times and the subscription a few (ultimately going so far as to set it to entity:agent1-agent
). Each time the config was changed it was applied with sensuctl create -f ...
, the backend emitted the expected check configs updated
, occasionally accompanied by the check schedule changed
and restarting scheduler
log entries, the output of sensuctl check info ...
showed the updated details, and the check continued to never be scheduled for execution.
After restarting the backend container the check started being scheduled immediately and the same interval changes that failed to apply before successfully restarted the scheduler when re-applied. I have not been able to reproduce it again in this testing env since this backend restart, even after completely nuking and recreating the environment. I don't know what it is that is causing this broken state to be entered, but it seems to be triggered reliably in our prod, and very unreliably outside of it.
Having to restart the backend to pickup a minor config change is rather frustrating. This seems to happen basically every time we rollout a config change in prod now. We collect logs for everything and would love to provide more info so that this can be finally fixed, but the scheduler is not well instrumented (at least still in 5.19.1) so I'm not sure what else I can do here but keep bouncing procs.
Unfortunately I'm not sure what to make of this issue. It isn't something we are observing in QA, it isn't something we are observing in our own monitoring, and we haven't had anyone report a similar issue. :confused:
Perhaps we need to increase the amount of debug logging the scheduler is emitting, so we can trace its state more confidently.
Yeah it's got me stumped too, I've tried a few times to reproduce it outside of prod and haven't been able to until today which is why I commented again. I appreciate that bugs that don't have clear reproduction steps are the most difficult to track down and fix, and I agree that adding some more logging to the scheduler would be a good next step.
I've got to upgrade our fork to 5.20 at some point anyway, so I might just add some logging in when I do and see what it yields.
After updating a check config, the
IntervalScheduler
for the check logged that it was restarting, but never appeared to do so.Expected Behavior
After seeing the following log entries:
I would expect to see either
interval schedule has changed
orcheck schedule has not changed
logged also, according to https://github.com/sensu/sensu-go/blob/9b3c09c5071f1d35b8cc6a3711bfcba5081a446b/backend/schedulerd/interval_scheduler.go#L107-L117 which should have been called after therestarting scheduler
log message was sent by https://github.com/sensu/sensu-go/blob/9b3c09c5071f1d35b8cc6a3711bfcba5081a446b/backend/schedulerd/check_watcher.go#L150-L153eg: from a healthy node:
Current Behavior
The above logs from the scheduler are the last logs present from the scheduler (over a period of around 5 hours) despite multiple check config changes that should have triggered scheduler restarts. My understanding from the source (and previous observations of logs) is that every check config change should trigger some kind of scheduler message in the logs, even if it's just confirming the schedule didn't change. The complete absence of logs since the entries I pasted above suggests that the scheduler or the
checkWatcher
got into some kind of broken state.Checks continued to be scheduled at existing intervals, so the scheduler still appears to be running in some capacity, but changes to check intervals do not trigger any logs from the scheduler nor do they change the interval checks are being scheduled at.
Additionally, we have a check that doesn't currently match any entities which, in a normal state, periodically logs the following:
however no logs matching this were emitted since the scheduler restart failure. After restarting sensu-backend the logs returned.
Steps to Reproduce (for bugs)
I am unsure how the backend got into this state. It is a sole master, so there's no clustering concerns. If it happens again I will dig further to try and get some reproduction steps.
Restarting the Sensu backend forcibly restarts the schedulers and hence obviously resolves the issue.
Your Environment
.deb.
package