msolli / proletarian

A durable job queuing and worker system for Clojure backed by PostgreSQL or MySQL.
MIT License
178 stars 9 forks source link

Worker stops on AssertionError #4

Closed tothda closed 3 weeks ago

tothda commented 3 years ago

I noticed that when an assert statement fails in one of my jobs the whole proletarian Queue Worker shuts down. A failing assertion generates an AssertionError which is not a subclass of Exception. So I guess the code should catch Throwable here: https://github.com/msolli/proletarian/blob/main/src/proletarian/worker.clj#L61

msolli commented 3 years ago

The reason why Proletarian does not catch Throwable is because then we would also catch errors that are subclasses of java.lang.Error, which include serious and critical conditions in the JVM that we could have no reasonable hope of dealing with - StackOverflowError and OutOfMemoryError in particular comes to mind. But really, any Error exception indicate a condition where the best course of action is to throw up our hands and say "all right, let's bail".

You could make a simple wrapper function for your handlers if you cannot prevent AssertionError from being thrown:

(defn handle-job!
  [job-type payload]
  ;; Do stuff that might throw AssertionError
  )

(defn throwable-catching-handler-fn
  [f]
  (fn [job-type payload]
    (try
      (f job-type payload)
      (catch AssertionError e
        (throw (ex-info "AssertionError in handler" {} e))))))

(worker/create-queue-worker
  data-source
  (throwable-catching-handler-fn handle-job!))

Now, I would not recommend you do this if you have control of the code that throws the AssertionError.

AssertionError is something that either is thrown explicitly by the developer ((throw (AssertionError.))), or implicitly by the Clojure assert macro. Either way, in my mind an assertion is a very strong signal to (and from!) the programmer that a condition has occured that should not have been possible, and that the code must be fixed. There is no runtime fix possible. This is why Proletarian simply shuts down when encountering an uncaught Throwable. There is something wrong with the code (in the case of AssertionError) or the runtime enviroment (in other subclasses of Error), and it must be fixed. Proceeding with incorrect code is not a reasonable thing to do.

Now, you may have other intuitions about all this, and there may be constraints out of your control that prohibits you from rewriting the code to not throw AssertionError. Then I propose the above wrapper function as a solution to your predicament.

vemv commented 9 months ago

Although I generally would accurately disable assertions in production, I'd say that Proletarian's default behavior could be improved.

It's simple and easy enough to catch AssertionError in addition to Exception. It's not like that would open the doors to also handling disparate exceptions - you would have to worry about essentially two cases:

AssertionError has well-known semantics for Java and Clojure developers alike. It's been basically always there, it won't go anywhere either, so it seems best handled (which yes, it's a bit annoying, but it's part of making an exhaustive handler IME).

Most practically, this affects how Proletarian behaves in dev/test systems.

For your consideration 🙂

Thanks - V

msolli commented 9 months ago

I can see three approaches for how to handle AssertionError:

  1. Catch and ignore (and log it). Continue working on jobs.
  2. Catch and retry - the same semantics as for Exception.
  3. Don't catch and shutdown worker - the current approach.

How do you suggest we handle AssertionError? Am I missing an approach here?

vemv commented 9 months ago

Hi! I'd recommend 2

Let's take this defn as a case study:

(defn foo [x]
  {:pre [x]}
  (.toString x))

Let's say that a job invokes (foo nil)

In both cases, we can say, there was an application bug, so the best course of action is to fix the bug and retry the job.

The only thing that changes is that the AssertionError was thrown a little before.

Normally, a good assertion simply fails faster (and more clearly) code paths that would fail sooner or later anyway. So, because they're the "same", they deserve the same treatment IMO.

Cheers - V

msolli commented 9 months ago

OK, so we agree that AssertionError should make the job fail, and that this is caused by an application bug that must be fixed before the job can be retried.

Now, the retry semantics of approach 2 is to retry according to a retry strategy. This strategy is provided by the application, and is tailored to the use case of the job. It might be to retry a few times in quick succession and then back off exponentially, or a more hand-crafted approach that takes into account data from the Exception, or some other strategy. The likelihood that the bug has been fixed by the time the job is retried is in all cases not very great. Every retry until the bug is fixed will fail, both for this job and other instances of the same job type, causing noise and duplication of errors in the log, but more importantly: The job might fail permanently before a fix is deployed, because the retry strategy has determined that there are no more retries, causing the job to go into the Failed state.

Contrast this with the current approach (3). The worker, upon catching a Throwable, will stop the worker. The job will be left as it is in the queue, invoking no retry mechanism. If there are more worker instances working on the same queue, they will pick up the same job, and they too will fail and stop, leaving the job in the queue. The logs will report on the failed jobs, hopefully with a severity sufficient to catch the attention of the team. A fix must then be developed, tested and deployed, and the work can proceed.

The upside of this approach is twofold:

I can see the downside, too: Jobs that are in the same queue, but that are different job types, might not have the same problem as the failing message. But they would nevertheless be delayed by the worker downtime caused by the AssertionError. The remedy would be to have separate queues and workers for these jobs. This is more of an operational and team maturity issue, though, than a technical one. One must consider the business process implication of running different types of jobs in the same queue.

As you can tell, I'm not convinced that changing the current approach is the right thing to do. There's a workaround (the wrapper function in my first reply above) available if you need it. There's also the :proletarian/on-polling-error option, where you can handle Throwables however you like (but the retry strategy mechanism is still bypassed, so implementing approach 2 using this is not possible).

vemv commented 9 months ago

I can see the downside, too: Jobs that are in the same queue, but that are different job types, might not have the same problem as the failing message.

Yes, this nuance makes 3 too heavy-handed of an approach IMO.

It would make sense if the retry strategies did take into account the class/contents of the given exception. Approximate proposed logic:

This fine-grained logic would seem an overall improvement to me. Even when disregarding the AssertionError topic entirely, it makes sense to me that retriability should be encoded in (/ inferred from) the exception class/instance.

Thanks - V

msolli commented 3 weeks ago

I'm going to leave the design as it is. My reasoning has been laid out in this thread. A simple workaround is available (wrapper function, see my first comment in this issue) for folks that don't agree. Thanks for raising the issue and providing valuable feedback and pushback. Hopefully the discussion here can leave others capable of making their own decisions when it comes to handling AssertionErrors.