Open dd32 opened 7 years ago
A daemon being killed off while running jobs is especially relevant in horizontally scaled setups. If a node is taken offline, then the jobs can get stuck in this state, and have to be restarted manually.
@larssn Indeed, we've been working on solving this ourselves. It's tough to come up with a solution to it. Mostly, we've been focusing on ensuring the daemon safely shuts down the workers.
If anyone has better ideas, all ears. :)
Could you make use of the flock
side-effect of clearing file locks when the PHP process dies? We've used this on BackUpWordPress (https://github.com/humanmade/backupwordpress/pull/1025) to detect when a long running process has crashed / is killed so we can update it's status accordingly rather than having it forever show as (incorrectly) running.
Unfortunately not, due to the horizontal distribution. IIRC NFS doesn't support file locks either.
The "correct" approach here for horizontally distributed apps would be for each daemon to have a DB row listing it's status, and where it's running from, with a date stamp bumped every minute or so.
Jobs would then have to be listed as status = running, server = ID#4
.
Other daemons would need to periodically check the table to see if any other daemons had started, accepted jobs, and gone away without marking themselves as shutdown, and initiate a cleanup.
It'd probably need to be implemented as a wp-cli job which each daemon fires off every ~5mins (ie. it can't be a listed job, it'd have to be a custom thing, as you want it on each server) which performs the health checks.
See https://github.com/humanmade/Cavalcade-Runner/issues/75 for a solution for system shutdown, where we will pass SIGTERM to the workers and SIGKILL if they fail to respond.
Similar to #18, we ran into an issue on WordPress.org where one of cavalcade daemons was killed unexpectantly which left a bunch of jobs in an unknown state.
The jobs were marked as running, but there were no workers to manage those jobs anymore. This resulted in jobs not running for a few hours/days until they were detected and restarted.
There should be some way for a job to be detected as no-longer-running or that it's daemon is MIA.
In this case, I simply restarted the jobs:
UPDATE wp_cavalcade_jobs SET status = 'waiting' WHERE status = 'running' AND nextrun <= '2016-12-06'