symbiote / silverstripe-queuedjobs

A module that provides interfaces for scheduling jobs for certain times.
BSD 3-Clause "New" or "Revised" License
57 stars 74 forks source link

"Broken jobs were found in the job queue" error spam #299

Open sminnee opened 4 years ago

sminnee commented 4 years ago

I have queuedjobs set up on a site with raygun error logging.

If a job breaks (which reports an error via raygun) then roughly once an hour I will get a subsequent message "Broken jobs were found in the job queue".

Because this leads to raygun notification, this gets quite spammy, especially on a weekend. Since the site in question recreates jobs periodically anyway, and the broken job is benign, this is doubly so.

A few thoughts about how to address this; one or more of these might be useful.

It would be interesting to hear whether other deployments of queuedjobs have this issue.

If it turns out these facilities already exist then I would suggest that we address this ticket by updating docs, as I couldn't see mention of this in the docs.

micschk commented 4 years ago

These 'broken jobs' messages have once used up around a 1000 euros in SMS-budget overnight on a critical system which I had temporarily set up an SMS error handler for... :-)

I think currently every cron-run checks & outputs these alerts so if you're running one or even multiple threads each minute this can result in a lot of alerts.

Instead of outputting these alerts periodically or with a lower (configurable) frequency, wouldn't it make sense to just output an alert only once (per broken job)?

sminnee commented 4 years ago

Generally speaking a job will have broken because of an error, and that error will have been passed to whatever system you have in place for error handling. So I don't think "notify once" is needed; if you disabled it entirely you would end up with the functionality you seek.

micschk commented 4 years ago

Which would ideally be the case indeed. But often job failure may caused by running out of memory or otherwise getting stuck on something and being restarted/stopped at some point by the runner, then error handling tends to not (always) get executed. I think that's the reason for the job-health checking being in place(?).

So for me it is important to get notified of 'failed' jobs (via e-mail/sms), just not every minute. Also we don't set up Raygun/Sentry on every system so relying on a third party for notifications would be less desirable.

michalkleiner commented 4 years ago

An example for us is checking for potential composer package updates within CWP, where it's a part of the default recipe. The task there in some circumstances fails on insufficient memory, possibly due to a bug in the checker, who knows. Unscheduling/deleting the job is not a solution as it always gets recreated by dev/build.

chillu commented 4 years ago

Duplicate of https://github.com/symbiote/silverstripe-queuedjobs/issues/24?

sminnee commented 4 years ago

Closely related but I believe “broken jobs” and “stalled jobs” have different messages

mfendeksilverstripe commented 4 years ago

My general feedback (based on multiple projects):