Report errors to the Rails error reporter

npezza93 commented 1 month ago

This was kinda gnarly to track down so apologies in advance for the crazy repro steps.

When running solid queue i noticed errors were not getting reported to the error reporter. I initially thought it was a rails problem but digging in more i noticed they are successfully reported on other adapters(like sidekiq) just not SQ. So i created this repo to help show off the issue: https://github.com/npezza93/job_error_test

gh repo clone https://github.com/npezza93/job_error_test
cd job_error_test
bundle
bin/rails db:prepare
bin/rails s

bin/rails c
ErrorJob.perform_later

Once you enqueue the job, go to the server and you should hit a bunch of irb bindings. The first is inside the ActiveSupport reloader. Nothing really to note here but in case you wanted to look around. The next is inside the ExecutionWrapper. This is where the issue lies. Inside the wrap method is where errors are rescued and then reported. But if you call active? it is true here and so it exits and doesnt rescue any errors. This execution wrapper gets setup inside activejob/lib/active_job/execution.rb and the execute callback is defined in activejob/lib/active_job/railtie.rb:78. (heres a full diff of changes i made to rails: https://github.com/rails/rails/compare/main...npezza93:rails:queue-test)

So from what i can tell wrapping the thread_execution inside Pool with wrap_in_app_executor basically wraps active job twice and so the ExecutionWrapper thinks it's already active.

In that test repo i have my branch on solid queue commented out in the Gemfile. If you uncomment and bundle you can see that active? is now false if you rerun everything.

Im honestly not sure if this wrapping was serving some other purpose that i've now broken. So if it is still needed i think we will need to wrap everything before and everything after ActiveJob::Base.execute(job.arguments) to avoid a double wrapping but let me know.

The errors were getting swallowed which is why the outer executor never saw them. Now i reraise the error when the job fails and then rerescue them on the outside of the executor so the thread doesnt error out. As Jean pointed out "ErrorHandle interface handles repeated reporting correctly" so having the on_thread_error rereport the error is fine as duplicate errors wont be reported.

npezza93 commented 1 month ago

Ill also note the app_executors are the same between the ActiveJob one and the SQ one. So im not entirely sure why the SQ executor never picks up the rescue if the AJ one thinks it's active. I would have expected the error to bubble up but it never does. @byroot if youre in here, maybe you have an idea?

byroot commented 1 month ago

Isn't it because the call you removed was in a background thread? Hence it's missing lots of context state?

Just a guess, I'd need to dig more as I'm not familiar at all with the SQ codebase.

npezza93 commented 1 month ago

The errors were getting swallowed which is why the outer executor never saw them. Now i reraise the error when the job fails and then rerescue them on the outside of the executor so the thread doesnt error out.

byroot commented 1 month ago

Ah I see. Perhaps we could evolve Executor#wrap that that when used recursively, the nested call still shortcut, but also still report errors. Especially now that the ErrorHandle interface handles repeated reporting correctly.

rosa commented 1 week ago

@npezza93 @byroot, thanks a lot for looking into this, and sorry for the delay! Good catch, I think this makes sense. I think I got confused in the past by looking at what Sentry, Rollbar and other error reporting tools did to report errors with Active Job 🤔 Or perhaps it was something else, but there was a reason not to re-raise there... but this was from the very beginning, so whatever it was, it might not be applicable now. Let me do some tests and review this carefully.

npezza93 commented 1 week ago

No problem @rosa! I've been testing in prod with this branch and so far so good. Let me know wif you find anything on your end!

rosa commented 6 days ago

Published version 1.0.2 with this fix 😊 Thanks again!

rails / solid_queue

Report errors to the Rails error reporter #373