humanmade / Cavalcade

A better wp-cron. Horizontally scalable, works perfectly with multisite.
https://engineering.hmn.md/projects/cavalcade/
Other
529 stars 46 forks source link

Catch orphaned "running" jobs from MIA workers #31

Open dd32 opened 7 years ago

dd32 commented 7 years ago

Similar to #18, we ran into an issue on WordPress.org where one of cavalcade daemons was killed unexpectantly which left a bunch of jobs in an unknown state.

The jobs were marked as running, but there were no workers to manage those jobs anymore. This resulted in jobs not running for a few hours/days until they were detected and restarted.

There should be some way for a job to be detected as no-longer-running or that it's daemon is MIA.

In this case, I simply restarted the jobs: UPDATE wp_cavalcade_jobs SET status = 'waiting' WHERE status = 'running' AND nextrun <= '2016-12-06'

larssn commented 7 years ago

A daemon being killed off while running jobs is especially relevant in horizontally scaled setups. If a node is taken offline, then the jobs can get stuck in this state, and have to be restarted manually.

rmccue commented 7 years ago

@larssn Indeed, we've been working on solving this ourselves. It's tough to come up with a solution to it. Mostly, we've been focusing on ensuring the daemon safely shuts down the workers.

If anyone has better ideas, all ears. :)

willmot commented 7 years ago

Could you make use of the flock side-effect of clearing file locks when the PHP process dies? We've used this on BackUpWordPress (https://github.com/humanmade/backupwordpress/pull/1025) to detect when a long running process has crashed / is killed so we can update it's status accordingly rather than having it forever show as (incorrectly) running.

rmccue commented 7 years ago

Unfortunately not, due to the horizontal distribution. IIRC NFS doesn't support file locks either.

dd32 commented 7 years ago

The "correct" approach here for horizontally distributed apps would be for each daemon to have a DB row listing it's status, and where it's running from, with a date stamp bumped every minute or so. Jobs would then have to be listed as status = running, server = ID#4.

Other daemons would need to periodically check the table to see if any other daemons had started, accepted jobs, and gone away without marking themselves as shutdown, and initiate a cleanup.

It'd probably need to be implemented as a wp-cli job which each daemon fires off every ~5mins (ie. it can't be a listed job, it'd have to be a custom thing, as you want it on each server) which performs the health checks.

rmccue commented 7 months ago

See https://github.com/humanmade/Cavalcade-Runner/issues/75 for a solution for system shutdown, where we will pass SIGTERM to the workers and SIGKILL if they fail to respond.