"Broken jobs were found in the job queue" error spam

sminnee commented 4 years ago

I have queuedjobs set up on a site with raygun error logging.

If a job breaks (which reports an error via raygun) then roughly once an hour I will get a subsequent message "Broken jobs were found in the job queue".

Because this leads to raygun notification, this gets quite spammy, especially on a weekend. Since the site in question recreates jobs periodically anyway, and the broken job is benign, this is doubly so.

A few thoughts about how to address this; one or more of these might be useful.

Add a config option to decide on whether "Broken jobs were found in the job queue" errors should be thrown
Add a facility where broken jobs can be automatically retried
Lower the frequency of such alerts – a daily alert to go and clean up jobs might be more usefrul.

It would be interesting to hear whether other deployments of queuedjobs have this issue.

If it turns out these facilities already exist then I would suggest that we address this ticket by updating docs, as I couldn't see mention of this in the docs.

micschk commented 4 years ago

These 'broken jobs' messages have once used up around a 1000 euros in SMS-budget overnight on a critical system which I had temporarily set up an SMS error handler for... :-)

I think currently every cron-run checks & outputs these alerts so if you're running one or even multiple threads each minute this can result in a lot of alerts.

Instead of outputting these alerts periodically or with a lower (configurable) frequency, wouldn't it make sense to just output an alert only once (per broken job)?

sminnee commented 4 years ago

Generally speaking a job will have broken because of an error, and that error will have been passed to whatever system you have in place for error handling. So I don't think "notify once" is needed; if you disabled it entirely you would end up with the functionality you seek.

micschk commented 4 years ago

Which would ideally be the case indeed. But often job failure may caused by running out of memory or otherwise getting stuck on something and being restarted/stopped at some point by the runner, then error handling tends to not (always) get executed. I think that's the reason for the job-health checking being in place(?).

So for me it is important to get notified of 'failed' jobs (via e-mail/sms), just not every minute. Also we don't set up Raygun/Sentry on every system so relying on a third party for notifications would be less desirable.

michalkleiner commented 4 years ago

An example for us is checking for potential composer package updates within CWP, where it's a part of the default recipe. The task there in some circumstances fails on insufficient memory, possibly due to a bug in the checker, who knows. Unscheduling/deleting the job is not a solution as it always gets recreated by dev/build.

chillu commented 4 years ago

Duplicate of https://github.com/symbiote/silverstripe-queuedjobs/issues/24?

sminnee commented 4 years ago

Closely related but I believe “broken jobs” and “stalled jobs” have different messages

mfendeksilverstripe commented 4 years ago

My general feedback (based on multiple projects):

email notifications are not that useful (for both stalled and broken jobs)
instead we rely on Raygun reporting
checking queue health is really useful as it applies automatic resume attempts for stalled jobs
to further reduce the number of broken jobs we have to deal with, we use automatic retry system for broken jobs, I added this system to the feature review PR
this is very useful for jobs that may break but the error can be safely ignored (jobs that trigger third party requests (request failure), embargo publish of multiple localisation of the same page (DB deadlock))

symbiote / silverstripe-queuedjobs

"Broken jobs were found in the job queue" error spam #299