Open sminnee opened 4 years ago
These 'broken jobs' messages have once used up around a 1000 euros in SMS-budget overnight on a critical system which I had temporarily set up an SMS error handler for... :-)
I think currently every cron-run checks & outputs these alerts so if you're running one or even multiple threads each minute this can result in a lot of alerts.
Instead of outputting these alerts periodically or with a lower (configurable) frequency, wouldn't it make sense to just output an alert only once (per broken job)?
Generally speaking a job will have broken because of an error, and that error will have been passed to whatever system you have in place for error handling. So I don't think "notify once" is needed; if you disabled it entirely you would end up with the functionality you seek.
Which would ideally be the case indeed. But often job failure may caused by running out of memory or otherwise getting stuck on something and being restarted/stopped at some point by the runner, then error handling tends to not (always) get executed. I think that's the reason for the job-health checking being in place(?).
So for me it is important to get notified of 'failed' jobs (via e-mail/sms), just not every minute. Also we don't set up Raygun/Sentry on every system so relying on a third party for notifications would be less desirable.
An example for us is checking for potential composer package updates within CWP, where it's a part of the default recipe. The task there in some circumstances fails on insufficient memory, possibly due to a bug in the checker, who knows. Unscheduling/deleting the job is not a solution as it always gets recreated by dev/build.
Closely related but I believe “broken jobs” and “stalled jobs” have different messages
My general feedback (based on multiple projects):
I have queuedjobs set up on a site with raygun error logging.
If a job breaks (which reports an error via raygun) then roughly once an hour I will get a subsequent message "Broken jobs were found in the job queue".
Because this leads to raygun notification, this gets quite spammy, especially on a weekend. Since the site in question recreates jobs periodically anyway, and the broken job is benign, this is doubly so.
A few thoughts about how to address this; one or more of these might be useful.
It would be interesting to hear whether other deployments of queuedjobs have this issue.
If it turns out these facilities already exist then I would suggest that we address this ticket by updating docs, as I couldn't see mention of this in the docs.