taoensso / carmine

Redis client + message queue for Clojure
https://www.taoensso.com/carmine
Eclipse Public License 1.0
1.16k stars 131 forks source link

Silently catches OutOfMemoryErrors without rethrowing #301

Closed rap1ds closed 3 months ago

rap1ds commented 6 months ago

Hi!

First, the root cause: we have a memory leak somewhere in our own application code. That's a separate, non-carmine-related issue that I'm trying to solve separately.

But how this relates to carmine is that the memory leak caused our handler code to throw java.lang.OutOfMemoryError. This error is a non-recoverable error that should not be caught and should propagate and, eventually, kill the process.

We run our app in the cloud, and whenever the worker process dies, a new one is started automatically. However, this didn't happen, because it seems that carmine catches Throwables (the link points to version 3.2.0, which is the one we use), including errors like OutOfMemoryError.

I believe Throwables shouldn't be caught, or if they are caught for logging purposes like here, they should be rethrown. This will let the app crash, which is the only sensible thing to do for non-recoverable errors.

I searched for old issues and PRs and found this https://github.com/taoensso/carmine/pull/20, which is probably the origin of catching Throwables instead of Exceptions. In the PR, the argument for the change was that this way assertion errors can be caught. However, to my understanding, assertions throw an Error, not an Exception, for a reason, and they shouldn't be caught.

ptaoussanis commented 6 months ago

@rap1ds Hi Mikko, thanks for pinging about this - and for the crystal clear report and references. Really appreciate it 🙏

However, to my understanding, assertions throw an Error, not an Exception, for a reason, and they shouldn't be caught

This was indeed a very unfortunate decision IMO, and continues to cause endless hassles.

I'm definitely not keen on rethrowing all errors, since that'll include the type of assertion failures that will be part of normal operation (certainly in Clojure, at least).

Will give this some thought next time I'm on Carmine. May be a bit messy, but one solution would be conditionally rethrowing OOMs, etc.

Is there any urgency from your side for a fix in Carmine?

ptaoussanis commented 6 months ago

Oh, I should add- sorry for the trouble! This was definitely an oversight.

rap1ds commented 6 months ago

@ptaoussanis Thanks for super duper fast response 😄

I'm definitely not keen on rethrowing all errors May be a bit messy, but one solution would be conditionally rethrowing OOMs, etc.

Yeah, I totally understand. Rethrowing all errors would be a big breaking change for the existing behavior. Letting the user to configure if errors are rethrown was one possible solution that I had in my mind. But also, conditionally rethrowing some errors like OutOfMemory is also a good solution candidate (or the opposite, conditionally not rethrowing some errors like AssertionErrors).

Is there any urgency from your side for a fix in Carmine?

Definitely not, but thanks for asking, really appreciate it 🙏

I'm anyway trying to hunt down the root cause for the memory leak, so that will be the ultimate fix, hopefully, if I find it 😄

In addition, I was thinking as a workaround to wrap our handler function with a try-catch and catch Throwables there, and shutdown the process. I didn't fully test it yet but the idea is to do something like this:

;; handler function:
(try
  (do-the-work ,,,)
  (catch Exception e
    (do-some-logging e)
    )
  (catch Throwable t
    (do-some-logging t)

    ;; shutdown hook will handle stopping the system
    ;; use future to avoid dead-lock
    (future (.exit (Runtime/getRuntime) 0))

    ;; rethrow
    (throw t)))
ptaoussanis commented 6 months ago

(or the opposite, conditionally not rethrowing some errors like AssertionErrors)

That's a good idea 👍 I've just added support to Encore's experimental master for easily catching Exception + AssertionError.

Something like that should hopefully be reasonable. Will look closer at switching over when I'm next doing batched work on Carmine.

In the meantime feedback still very welcome if anyone wants to propose alternatives. I'm currently focused on Telemere - so it'll be a few more weeks before I switch context back to Carmine.

ptaoussanis commented 4 months ago

@rap1ds Hi Mikko! Just to update-

I've given some more thought to rethrowing critical handler errors, and am a little hesitant for a few reasons:

  1. While many users might want an OOM to trigger automatic handler shutdown, I'm not completely confident that all of them would want that behaviour.
  2. Even if a user does want an OOM to trigger handler shutdown, I believe there's a good chance that they'd want to do that in a more controlled way than just letting the worker thread throw. For example they may want to control whether the job will be retried later vs marked as done, and they may want to activate a controlled graceful shutdown procedure/sequence that's relevant to their app.

What I propose instead is just documenting the current behaviour, and recommending that users include appropriate error catching in their handler fn (much like you have in your "workaround" above).

That's a little more verbose, but I think would help maximize control and visibility of semantics.

That'd also help resolve the question of exactly which errors to consider critical, since it'd be up to the user to decide what's relevant in their case.

Does that seem reasonable to you?

rap1ds commented 4 months ago

@ptaoussanis Thanks for coming back to the issue

Yes, that sounds reasonable. I think it's a good idea to let the user decide what to do.

I did some investigation and thinking about what the options are to handle OOM on the user's side, and here's what I found:

1. Use JVM flag -XX:+ExitOnOutOfMemoryError

When this flag is set, JVM will abort on the first occurrence of OOM. In my local testing, the abort will happen even if the OOM Error is caught and not rethrown. The problem with this flag is that it just aborts immediately, meaning that it doesn't run any shutdown hooks and doesn't let the user do any cleanup.

2. Catch the error on the handler and exit the system

This option is the "workaround" mentioned in my earlier comment:

;; handler function:
(try
  (do-the-work ,,,)
  (catch Exception e
    (do-some-logging e)
    )
  (catch Throwable t
    (do-some-logging t)

    ;; shutdown hook will handle stopping the system
    ;; use future to avoid dead-lock
    (future (.exit (Runtime/getRuntime) 0))

    ;; rethrow
    (throw t)))

I see one problem with this: It's kinda ad-hoc and handles the OOM only on this piece of code. Other parts of the code base need to implement their own OOM handling. In that sense it would be nicer to let the OOM error bubble up and let the (Thread/setDefaultUncaughtExceptionHandler ,,,) handler decide what to do. Also, in case the OOM happens inside Carmine's code that is outside of this try-catch block but still inside Carmine's own try-catch (which catches Throwables), the OOM error is caught but not bubbling up, meaning that the user doesn't have a chance to initiate shutdown.

3. Catch and rethrow the error on the handler and let it bubble to default uncaught exception handler

In this option, we set a (Thread/setDefaultUncaughtExceptionHandler ,,,) handler that decides what to do with OOM error.

In the handler, we catch and rethrow the error keeping in mind that we need to do it in another thread to escape the Carmine's try-catch block. In my first attempt I used (future (throw t)) but that failed since exceptions inside future are always caught by the Future. TIL. So I changed the future to a plain Thread.

;; handler function:
(try
  (do-the-work ,,,)
  (catch Exception e
    (do-some-logging e)
    )
  (catch Throwable t
    (do-some-logging t)

    ;; use a new Thread to escape Carmine's try-catch. 
    ;; Let the default uncaught exception handler decide what to do.
    (.start (Thread. (fn [] (throw t)))))

    ;; rethrow
    (throw t)))

The nice thing about this option is that OOM handling can now happen in one single place for the whole codebase, inside the default uncaught handler. However, this option still suffers from the fact that if the OOM happens inside Carmine's code and Carmine catches the OOM, user doesn't have chance to initiate shutdown, same as with the option 2.

4. Add error callback to Carmine

The last option I have in my mind requires a new error callback in Carmine.

Carmine catches the Throwables and if I read the source code correctly, it rethrows it a couple of times but the bubbling of the exception ends here. One thing that could be useful is that instead of just logging the error and retrying after a backoff time, Carmine could call a user-defined error callback with the Throwable as the parameter. The user could then decide what to do with the error. And since the callback isn't inside the try-catch anymore, the Throwable could be rethrown and it would bubble up.

So all in all it would look something like this:

;; shutdown hook
(defn- shutdown []
  (com.stuartsierra.component/stop-system system))

(.addShutdownHook (Runtime/getRuntime)
                  (Thread. shutdown))

;; uncaught exception handler
(defn ex-handler []
  (reify Thread$UncaughtExceptionHandler
    (uncaughtException [_ thread ex]
      (do-some-logging ex)
      (when (instance? OutOfMemoryError ex)
        ;; This triggers shutdown hook
        (.exit (Runtime/getRuntime) 137)))))

(Thread/setDefaultUncaughtExceptionHandler (ex-handler))

;; start worker
(let [worker-conf (-> {:lock-ms ,,,
                       :nthreads ,,,
                       :eoq-backoff-ms ,,,
                       :throttle-ms ,,,
                       :handler job-handler
                       ;; new callback
                       :on-exception (fn [t]
                                       ;; throw and let bubble up
                                       (throw t))})]

  (car-mq/worker redis-conn queue-name worker-conf))

With this option the OOM handling can be completely removed from the handler function. No need for a new thread "trick" to escape the try-catch block. Also, if the OOM happens outside of the handler but in Carmine's code, inside the try/catch, Carmine will still notify the user about it, and let the user initiate shutdown if they decide so.

Conclusion

I think I'm going with option 3 for now, but if you think adding an error callback would be useful and decide to add it, then I'll switch to option 4.

ptaoussanis commented 4 months ago

Thanks for the thoughtful and clear feedback Mikko 🙏

I'm not keen on Option 4 (O4) since to my understanding that wouldn't provide much benefit over O2 (which I believe is also simpler and more idiomatic).

BTW I'm inclined to prefer O2 over O3 since O3 is coupled to the surrounding catching behaviour, which isn't within your control. O2 seems a more direct capture of your intentions- if this thing OOMs, do a system shutdown.

If you're concerned about duplicating code here and in your UncaughtExceptionHandler, you could just pull that code out into a function and call it from both.

So it seems your remaining discomfort comes from here:

Other parts of the code base need to implement their own OOM handling. [...] in case the OOM happens inside Carmine's code that is outside of this try-catch block but still inside Carmine's own try-catch [...]

Is there some specific place you have in mind? Most interactions with the Carmine API will be synchronous. So the only places I can think of off-hand that might be relevant to silent OOMs would be background threads from the message queue worker or Pub/Sub listener.

rap1ds commented 4 months ago

BTW I'm inclined to prefer O2 over O3 since O3 is coupled to the surrounding catching behaviour, which isn't within your control. O2 seems a more direct capture of your intentions- if this thing OOMs, do a system shutdown.

If you're concerned about duplicating code here and in your UncaughtExceptionHandler, you could just pull that code out into a function and call it from both.

That's a good point. Indeed, that would probably be more straightforward and easier to reason about.

Is there some specific place you have in mind? Most interactions with the Carmine API will be synchronous. So the only places I can think of off-hand that might be relevant to silent OOMs would be background threads from the message queue worker or Pub/Sub listener.

No, I don't have any specific place in my mind. And my discomfort on this one is really minor :) I guess if the thing OOMs it's most likely going to happen in the handler since that's where the heaviest work is going to happen.