ruby-shoryuken / shoryuken

A super efficient Amazon SQS thread based message processor for Ruby
Other
2.05k stars 279 forks source link

Recommended ActiveJob error handling is no longer relevant as of Rails 7.1.3 #775

Open jonny-ras opened 1 month ago

jonny-ras commented 1 month ago

After we upgraded our app to Rails 7.1.3, we noticed that errors were no longer going to our configured DLQs.

We had our set up following the recommended setup, from this comment: https://github.com/ruby-shoryuken/shoryuken/issues/553#issuecomment-465813111

In our ApplicationJob class we have the following, but we were never seeing any raised exceptions go to the DLQ.

retry_on ::StandardError, wait: 30.seconds, attempts: 3 do |job, exception|
   capture_exception(job, exception) unless ::Rails.env.development?
   job.log_error(message: "Error performing job", error: exception, p: job.arguments.first) if job.class.instance_variable_get(:@on_failure)
end

After a lot of digging into ActiveJob, I found this change: https://github.com/rails/rails/pull/48010/files

The relevant part of it is here: https://github.com/rails/rails/pull/48010/files#diff-c017408d8f8e9b0a914da28fdcebc291c5fba624174fb9ca5bec3b4fc1c61d51

According to the Rails docs, this after_discard block is meant to re-raise the exception, but from my testing it doesn't. If the after_discard block doesn't raise the exception, the exception is not automatically re-raised, and so Shoryuken thinks everything is OK, and the message doesn't DLQ.

I've fixed this by hooking into the ActiveSupport::Notification active_job.retry_stopped event, and re-raising the exception there.

# config/initializers/event_subscribers.rb
::ActiveSupport::Notifications.subscribe("retry_stopped.active_job", ::EventSubscribers::ActiveJobSubscriber.new)
# frozen_string_literal: true

# This class exists to circumvent ActiveJob not re-raising exceptions in the retry_on block.
# When those exceptions are not re-raised, the error is not bubbled up to Shoryuken, which means the message does not get DLQ'd.
#

module EventSubscribers
  class ActiveJobSubscriber
    def call(*args)
      event = ActiveSupport::Notifications::Event.new(*args)

      raise event.payload[:error] if event.name == "retry_stopped.active_job"
    end
  end
end

This issue is to mainly ask if this is indeed correct, should we be re-raising to Shoryuken for it to DLQ, and if not, how should I handle this?

I also assume the Rails code is broken, as it is not automatically re-raising the error for me, which is obviously not something to be handled by this gem, but it seems to be the current state of affairs (having checked Rails main, no changes there yet, I'll probably open an issue).