raintank / worldping-api

Worldping Backend Service
Other
25 stars 18 forks source link

queue worker jobs error handling strategy. #14

Open woodsaj opened 8 years ago

woodsaj commented 8 years ago

Issue by Dieterbe Friday Jul 24, 2015 at 15:16 GMT Originally opened as https://github.com/raintank/grafana/issues/367


what follows is partly a rant about the state of worker queue systems (to the extent I'm familiar with the space, hopefully I'm wrong), but also partly a well-written (or so I hope) piece where I highlight issues I've seen around worker queue semantics and their effects (while often subtle, they can have significant impacts), I want to get us on the same page and design for a deliberate behavior that is well understood across the various failure scenarios, and as hassle free as possible. I tried to keep it short but if it's still too long, you can skip to the last section.

My examples are around worker jobs for alerting that need to query graphite via the bosun library for data, but obviously the same concepts apply with other examples.

the 3 error categories

3 classes of errors that can appear during execution of a worker job:

  1. obviously fatal and the job will never succeed when retried (e.g. empty graphite querystring)
  2. obviously non-fatal and should succeed when retried: e.g. when graphite returns 5xx errors, the tcp connection breaks midway, etc.
  3. ambiguous: e.g.:
    A can't connect to graphite: could be fatal (misconfigured connection string) or non-fatal (graphite host is temporarily down and it needs restarting, manually or by an init system)
    B can't parse json response from graphite. if it's due to an error such as "series does not exist" it is pretty fatal if the series name is badly specified, or the job might succeed later when the series gets created. Or if the error is caused by a bug in the worker or in graphite that could be fixed and deployed fairly quickly, you can see this error as non-fatal.
    C a function specified in a bosun/graphite expression does not exist: perhaps there is a typo (fatal) or perhaps we forgot to deploy the code that provides the new function (non-fatal)

    message acknowledgement behavior based on error category

The traditional notion that I'm used to is simply

fatal error in job -> ack the message so it be removed from the queue and never retried non-fatal error in job -> don't ack the message so we can retry them later as they might succeed.

But this breaks down for a few reasons:

I don't recall ever seeing a worker queueing system that takes all these issues into account and provides insights (statistical overviews and breakdowns of which jobs have been retried for how long or x times), tools (dropping a bunch of jobs based on fine-grained criteria (scheduled between x and y AND if their resulting error was foo), it sounds like i want to store structured metadata for jobs and have an SQL interface to said metadata. Currently we all just seem to log errors in a log, which is so decoupled from metadata about the job inside of the queue. For example it might make more sense to store "bad-smtp-server: can't connect" in the job metadata for the queue so that we could select from job where error = "bad-smtp-server: can't connect"

So where does this leave raintank and Grafana?

luckily for the case of alerting jobs, and with our current simplistic system (stale jobs are not very useful because we mostly care about latest state because we only do critical notifications, stale jobs are only useful for filling in gaps in the health metrics)

so we can basically do one of two things:

I'm personally a fan of the latter approach.

woodsaj commented 8 years ago

Comment by woodsaj Sunday Jul 26, 2015 at 14:14 GMT


As far as i am concerned, any job that completes processing should be acked, regardless of whether an error was encountered or not. The purpose of the alerting jobs is to determine the state of something at a specific time. I think logging that you could not determine the state is more important then back-filling data. That is the while reason for having unknown as a valid state.

If we do want to handle brief transient errors, then we can either perform the retry inline X number of times before the error state increases to fatal. eg. perhaps we got a timeout because a backend node just died, retrying immediately will connect us to different backend, Alternatively instead of NACK'ing a job on error, we can queue a new job that includes some metadata about the previous errors and how many times the job has been tried and ACK the original job.

woodsaj commented 8 years ago

Comment by Dieterbe Monday Jul 27, 2015 at 11:09 GMT


As far as i am concerned, any job that completes processing should be acked, regardless of whether an error was encountered or not. The purpose of the alerting jobs is to determine the state of something at a specific time. I think logging that you could not determine the state is more important then back-filling data. That is the while reason for having unknown as a valid state.

I like these points. So the question then becomes which types of errors should cause the job to "complete" with status=unknown? Currently there's a whole class of errors (like graphite query failures) that bail out early with an error. Are you saying that basically any error encountered should cause us to try exiting with status=unknown as opposed to bailing out?

If we do want to handle brief transient errors, then we can either perform the retry inline X number of times before the error state increases to fatal. eg. perhaps we got a timeout because a backend node just died, retrying immediately will connect us to different backend, Alternatively instead of NACK'ing a job on error, we can queue a new job that includes some metadata about the previous errors and how many times the job has been tried and ACK the original job.

The problem with putting the retry in the actual job executor means it basically blocks the pipeline for other jobs that need processing, and we typically want to keep this timeframe short, meaning it also becomes less likely that certain types of errors get resolved in the short retry interval (i know if it's just an issue on 1 backend then getting LB'd to a different one should fix it but it's not a robust solution that covers all kinds of failure scenarios especially when they need time to resolve). The re-queue with updated metadata approach lets us build in a solid retry interval during which we can try other jobs. But of course then we have to worry about what happens when we can't store a new job in the queue, or we lose the connection right after putting in the new job, before we can ack the current message. Then we'll have duplicates in the queue. (why don't work queues have a built-in for incrementing attempt numbers and retrying a few times :( or do they)