queue worker jobs error handling strategy.

woodsaj commented 8 years ago

Issue by Dieterbe Friday Jul 24, 2015 at 15:16 GMT Originally opened as https://github.com/raintank/grafana/issues/367

what follows is partly a rant about the state of worker queue systems (to the extent I'm familiar with the space, hopefully I'm wrong), but also partly a well-written (or so I hope) piece where I highlight issues I've seen around worker queue semantics and their effects (while often subtle, they can have significant impacts), I want to get us on the same page and design for a deliberate behavior that is well understood across the various failure scenarios, and as hassle free as possible. I tried to keep it short but if it's still too long, you can skip to the last section.

My examples are around worker jobs for alerting that need to query graphite via the bosun library for data, but obviously the same concepts apply with other examples.

the 3 error categories

3 classes of errors that can appear during execution of a worker job:

obviously fatal and the job will never succeed when retried (e.g. empty graphite querystring)
obviously non-fatal and should succeed when retried: e.g. when graphite returns 5xx errors, the tcp connection breaks midway, etc.
ambiguous: e.g.:
A can't connect to graphite: could be fatal (misconfigured connection string) or non-fatal (graphite host is temporarily down and it needs restarting, manually or by an init system)
B can't parse json response from graphite. if it's due to an error such as "series does not exist" it is pretty fatal if the series name is badly specified, or the job might succeed later when the series gets created. Or if the error is caused by a bug in the worker or in graphite that could be fixed and deployed fairly quickly, you can see this error as non-fatal.
C a function specified in a bosun/graphite expression does not exist: perhaps there is a typo (fatal) or perhaps we forgot to deploy the code that provides the new function (non-fatal)
message acknowledgement behavior based on error category

The traditional notion that I'm used to is simply

fatal error in job -> ack the message so it be removed from the queue and never retried non-fatal error in job -> don't ack the message so we can retry them later as they might succeed.

But this breaks down for a few reasons:

non-fatal errors shouldn't be retried endlessly. after a while their business value decreases and we should drop them. Does our queue have semantics like "stop trying after x retries or y seconds" ? if it doesn't, does the worker have to maintain that state itself? (yuck) We definitely don't want to have jobs that get stuck in the queue and never succesfully finish. And it also depends on type of job and possibly application-specific logic: in our current design alert jobs become irrelevant once the next one is scheduled (which can be an arbitrary amount of time later), ideally we should be able to specify these subtleties at job creation time. As you start using a queue for more and more different types of jobs I often find the need to get insights into which types of jobs (and what errors they return) get stuck so I can write code to stop acking just those jobs with that specific type of errors. And haven't seen any robust api to a queue's job database to support this.
jobs that are considered fatal and dropped from the queue often have to be rescheduled (after a code fix is deployed). this is often more clunky then it should be. because the operator has to find out what are all the attributes that all the jobs that need retrying have in common, write the code that recreates all those jobs (with the fix), but often extra code needs to be written to make sure no other kinds of jobs are also accidentally recreated. this can be error-prone, it would be nice if we could tell the queue "rerun all jobs matching foo but change their config file/input parameter with this new value or through this sed expression"
the 3rd category (ambigous errors) is very hard to handle well: even if you knew all possible error formats in advance and tried to parse error messages with extreme accuracy to derive meaning (especially common in Go since errors are rarely distinguishable otherwise) you still need human interpretation to truly understand whether an error is fatal or not. It would be great if a workerqueue would alert an operator and say "just got the error 'invalid bosun function suum`" what should I do, and then we would be able to say "ok this is fatal, drop this one and all the other ones that return the same error". Since AFAIK no worker queue system does this, we basically have to program all error based behavior up front and either risk jobs that never complete and get stuck in the queue, or have them dropped from the queue when it actually turned out it wasn't the right thing to do, and then possibly we have to write custom code to reschedule the jobs.

I don't recall ever seeing a worker queueing system that takes all these issues into account and provides insights (statistical overviews and breakdowns of which jobs have been retried for how long or x times), tools (dropping a bunch of jobs based on fine-grained criteria (scheduled between x and y AND if their resulting error was foo), it sounds like i want to store structured metadata for jobs and have an SQL interface to said metadata. Currently we all just seem to log errors in a log, which is so decoupled from metadata about the job inside of the queue. For example it might make more sense to store "bad-smtp-server: can't connect" in the job metadata for the queue so that we could select from job where error = "bad-smtp-server: can't connect"

So where does this leave raintank and Grafana?

hopefully you guys can point out stuff that nullify my concerns
i'm going to keep my eyes open a bit for better workerqueue systems

luckily for the case of alerting jobs, and with our current simplistic system (stale jobs are not very useful because we mostly care about latest state because we only do critical notifications, stale jobs are only useful for filling in gaps in the health metrics)

so we can basically do one of two things:

treat category 2 and 3 as fatal for now, which is OK except the health metrics could remain null and if we wanted to backfill gaps it gets pretty messy figuring out which jobs need to be scheduled (hint: it would involve querying graphite) Also this shortcut no longer provides solace once we support warning emails for older data. Luckily at least the workers have access to a rich sql store with extensive data that lets us make better educated decisions (like latest state seen etc)
as soon as we want to support any non-fatal errors, we need to be on top of all errors, and make sure we can either resolve all errors or turn non-fatals into fatals in the code, so that we know jobs will never sit in the queue, and be retried, for ever. this requires a disciplined approach to tracking all errors and handling them. It's good to do this anyway but in this approach, we must. This also lets us treat category 3 as non-fatal (and enjoy the benefit of never having to re-issue jobs) as long as we're disciplined by staying on top of all errors.

I'm personally a fan of the latter approach.

woodsaj commented 8 years ago

Comment by woodsaj Sunday Jul 26, 2015 at 14:14 GMT

As far as i am concerned, any job that completes processing should be acked, regardless of whether an error was encountered or not. The purpose of the alerting jobs is to determine the state of something at a specific time. I think logging that you could not determine the state is more important then back-filling data. That is the while reason for having unknown as a valid state.

If we do want to handle brief transient errors, then we can either perform the retry inline X number of times before the error state increases to fatal. eg. perhaps we got a timeout because a backend node just died, retrying immediately will connect us to different backend, Alternatively instead of NACK'ing a job on error, we can queue a new job that includes some metadata about the previous errors and how many times the job has been tried and ACK the original job.

woodsaj commented 8 years ago

Comment by Dieterbe Monday Jul 27, 2015 at 11:09 GMT

As far as i am concerned, any job that completes processing should be acked, regardless of whether an error was encountered or not. The purpose of the alerting jobs is to determine the state of something at a specific time. I think logging that you could not determine the state is more important then back-filling data. That is the while reason for having unknown as a valid state.

I like these points. So the question then becomes which types of errors should cause the job to "complete" with status=unknown? Currently there's a whole class of errors (like graphite query failures) that bail out early with an error. Are you saying that basically any error encountered should cause us to try exiting with status=unknown as opposed to bailing out?

If we do want to handle brief transient errors, then we can either perform the retry inline X number of times before the error state increases to fatal. eg. perhaps we got a timeout because a backend node just died, retrying immediately will connect us to different backend, Alternatively instead of NACK'ing a job on error, we can queue a new job that includes some metadata about the previous errors and how many times the job has been tried and ACK the original job.

The problem with putting the retry in the actual job executor means it basically blocks the pipeline for other jobs that need processing, and we typically want to keep this timeframe short, meaning it also becomes less likely that certain types of errors get resolved in the short retry interval (i know if it's just an issue on 1 backend then getting LB'd to a different one should fix it but it's not a robust solution that covers all kinds of failure scenarios especially when they need time to resolve). The re-queue with updated metadata approach lets us build in a solid retry interval during which we can try other jobs. But of course then we have to worry about what happens when we can't store a new job in the queue, or we lose the connection right after putting in the new job, before we can ack the current message. Then we'll have duplicates in the queue. (why don't work queues have a built-in for incrementing attempt numbers and retrying a few times :( or do they)

raintank / worldping-api

queue worker jobs error handling strategy. #14

the 3 error categories

message acknowledgement behavior based on error category

So where does this leave raintank and Grafana?