humanmade / Cavalcade

A better wp-cron. Horizontally scalable, works perfectly with multisite.
https://engineering.hmn.md/projects/cavalcade/
Other
515 stars 46 forks source link

Auto retry failed jobs #60

Open xpd opened 6 years ago

xpd commented 6 years ago

Since out of the box monitoring is missing it would be nice to have the possibility of automatically retrying failed jobs. Along with a timeout this would have prevented almost all issues we've had.

remkade commented 5 years ago

I also would love this. Cavalcade is great, but failed jobs are causing a massive headache.

I'd be happy to work on a PR if there is any guidance on how people would prefer to handle edge cases and retry.

svandragt commented 5 years ago

Hey folks. We run this and have the same problem. Cavalcade assumes all failed jobs need to be investigated and manually resumed otherwise they don’t run again. On the other hand the reason for a job to fail is not always a problem with the code itself. For example external factors such as the connectivity problems. So I found that failed jobs with an interval are best marked as completed as this results in them being rescheduled. Jobs without interval are best marked as waiting so they are run again.

I built a self health job that corrects failed jobs older than x minutes which allows some time for an investigation. Ideally any job that repeatedly fails a number of times should not be corrected as a repeatable fail is likely caused by an issue on the application level. This should result in a notification of some kind?

If it is of interest I could spend a bit of time polishing this solution up and see if we can share it?

archon810 commented 5 years ago

Yes, very much so. I've been watching Cavalcade in hopes of resolving our constant issues with failed scheduled post.

It's possible that just by switching, this problem would be resolved. But in case it isn't, a retry mechanism would make sense. WP cron doesn't have one at all.

remkade commented 5 years ago

I have a MySQL Event (cronjob inside mysql basically) I made for a specific woocommerce subscriptions job that can't stop processing if it fails.

Here's what it looks like:

CREATE EVENT IF NOT EXISTS `fix_cavalcade`
ON SCHEDULE EVERY 1 HOUR
DISABLE ON SLAVE
COMMENT 'Sets cavalcade job status back to waiting + 2 minutes if it fails'
DO
  UPDATE wordpress.wp_cavalcade_jobs
  SET status = 'waiting'
  WHERE id = 18
    AND
    site = 86
    AND
    (
      status = 'failed'
      OR
      nextrun < NOW() - INTERVAL 2 HOUR
    );

If you're running this inside of RDS make sure you update you parameter group to enable the event scheduler! It took me weeks to figure out why it wasn't running. event_scheduler = ON.