que-rb / que

A Ruby job queue that uses PostgreSQL's advisory locks for speed and reliability.
MIT License
2.31k stars 188 forks source link

Que jobs running concurrently have interference on Rails 5. #166

Closed bramski closed 7 years ago

bramski commented 8 years ago

Hello,

We're having a consistently repeatable problem over here of what we think is a fundamental problem of the MVCC advisory lock model in use here.

I see this code, and it doesn't seem to be effective: https://github.com/chanks/que/blob/master/lib/que/job.rb#L88

Screenshot of our logs: as_a_user_i_want_lucy_to_deal_with_facebook_sending_the_same_message_twice_so_that_i_don_t_have_a_confused_conversation__-_lucy_-_confluence

Our Job Runner Looks like:

require 'que'

module ActiveJob
  module QueueAdapters
    class ImhrJobRunner < ::Que::Job
      # Places the que job in job_data["que_ref"]
      # which can be picked up by ApplicationJob
      def run(job_data)
        @job_data = job_data

        if que_job_done? # we only need to check cause Que picks up destroyed jobs.
          log_rerun_issue_info
        else
          job_data["que_ref"] = self
          ApplicationJob.execute job_data
        end

        mark_que_job_as_done # if you reach this point it's done
      end

      public :destroy

      private

      attr_reader :job_data

      def log_rerun_issue_info
        Rails.logger.warn "#{self.class.name}: Que tried to rerun #{job_data["job_class"]}: " \
                          "#{job_data["job_id"]}"
      end

      def que_job_done?
        Rails.cache.exist?(que_job_cache_key)
      end

      def mark_que_job_as_done
        Rails.cache.write(que_job_cache_key, true, expires_in: 10.seconds)
      end

      def que_job_cache_key
        ["que-job-done", job_data["job_id"]].join("-")
      end
    end
  end
end

We've added Rails cache which uses redis to ensure that the jobs aren't being run twice.

j15e commented 7 years ago

Hello, I have the same issue with my Rails 5 setup.

I get the WARNING: you don't own a lock type of ExclusiveLock too. Running ruby 2.3.1p112, rails 5.0.0.1 and psql (PostgreSQL) 9.4.4. Almost a vanilla project, I doubt anything is conflicting.

Really unfortunate issue because the gem seems great otherwise.

You could use the apparaisal gem to test que specs against multiple rails versions.

m-o-e commented 7 years ago

I'm seeing the same WARNING here. Vanilla Rails5, single que worker.

This issue should be re-opened. The que README should contain a clear warning that que is not compatible with Rails5.

chanks commented 7 years ago

Alright, issue reopened. I'm happy to accept a patch that isn't too invasive to Que's connection management.

bgentry commented 7 years ago

Since nothing changed in Que, it seems like this must be from some kind of change on the Rails/ActiveRecord side.

Has anybody been able to confirm that this issue is not present in a blank Rails 4.x app but is present in a blank Rails 5.x app?

bgentry commented 7 years ago

hey folks, I'd love to dig into this with some of my spare time these next couple weeks. Looking back through the issue, though, I don't see anything linking to a basic Rails app that reproduces this problem.

Can somebody who has encountered this issue please just create a brand new Rails app that has this problem and link to the Github repo with an explanation of how you trigger the errors? Ideally it would have as few changes as necessary from the initial generated app.

nbibler commented 7 years ago

@bgentry: I created a minimum viable Rails 5 + Queue 0.12 app which shows the issue very apparently:

https://github.com/nbibler/rails-5-que-exclusivelocking

You can see the commit history on that repo contains nothing unusual other than forcing Que to async mode in tests and avoiding the Rails ActiveJob::TestCase inheritance to allow Que to be used in test mode.

thedarkone commented 7 years ago

@nbibler thank you!

So what happens is that since Rails 5 it is expected that app servers/job queues are using an executor (Rails.application.executor) or a reloader (Rails.application.reloader) when entering "Rails application space" (my wording, not an official moniker). This is required so that Rails is able to maintain caches/connections/code reloading etc. ActiveJob does this automatically for que here: https://github.com/rails/rails/blob/93c9534/activejob/lib/active_job/queue_adapters/que_adapter.rb#L32 (via ActiveJob::Base.execute).

The problem is that que is entering Rails application space a bit earlier than expected. The solution is to declare the intent like this:

diff --git a/lib/que/adapters/active_record.rb b/lib/que/adapters/active_record.rb
index 083e1e6..6feeada 100644
--- a/lib/que/adapters/active_record.rb
+++ b/lib/que/adapters/active_record.rb
@@ -59,7 +59,9 @@ module Que
       private

       def checkout_activerecord_adapter(&block)
-        ::ActiveRecord::Base.connection_pool.with_connection(&block)
+        ::Rails.application.executor.wrap do
+          ::ActiveRecord::Base.connection_pool.with_connection(&block)
+        end
       end
     end
   end

Que::Adapters::ActiveRecord#cleanup!'s call into ::ActiveRecord::Base.clear_active_connections! thus becomes probably unnecessary as Rails does that automatically upon leaving an outer most executor block (executor is reentrant).

For more info there is a WIP guide rails/rails#27494 in the works.

chanks commented 7 years ago

Thanks thedarkone, very interesting. Tomorrow is my open source day at work so I should be able to review all that and probably make a bugfix release.

On Thu, Jan 19, 2017 at 9:21 PM thedarkone notifications@github.com wrote:

@nbibler https://github.com/nbibler thank you!

So what happens is that since Rails 5 it is expected that app servers/job queues are using an executor (Rails.application.executor) or a reloader ( Rails.application.reloader) when entering "Rails application space" (my wording, not an official moniker). This is required so that Rails is able to maintain caches/connections/code reloading etc. ActiveJob does this automatically for que here: https://github.com/rails/rails/blob/93c9534/activejob/lib/active_job/queue_adapters/que_adapter.rb#L32 (via ActiveJob::Base.execute).

The problem is that que is entering Rails application space a bit earlier than expected. The solution is to declare the intent like this:

diff --git a/lib/que/adapters/active_record.rb b/lib/que/adapters/active_record.rb

index 083e1e6..6feeada 100644

--- a/lib/que/adapters/active_record.rb

+++ b/lib/que/adapters/active_record.rb

@@ -59,7 +59,9 @@ module Que

   private

   def checkout_activerecord_adapter(&block)
  • ::ActiveRecord::Base.connection_pool.with_connection(&block)

  • ::Rails.application.executor.wrap do

  • ::ActiveRecord::Base.connection_pool.with_connection(&block)

  • end

    end

    end

    end

Que::Adapters::ActiveRecord#cleanup!'s call into ::ActiveRecord::Base.clear_active_connections! thus becomes probably unnecessary as Rails does that automatically https://github.com/rails/rails/blob/3c785bd/activerecord/lib/active_record/query_cache.rb#L39 upon leaving an outer most executor block (executor is reentrant).

For more info there is a WIP guide rails/rails#27494 https://github.com/rails/rails/pull/27494 in the works.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-273959427, or mute the thread https://github.com/notifications/unsubscribe-auth/AASUGtMjw6KwlR12jSCHsRAzWsilRtkCks5rUBoagaJpZM4JUVge .

bgentry commented 7 years ago

@nbibler thank you for being a good OSS citizen and providing a reproduction!

@thedarkone thanks for so quickly finding the issue afterwards and suggesting a fix :clap:

@chanks lmk if I can help, but it looks like you've got a handle on this one now 👍

chanks commented 7 years ago

I've just spent some time investigating this with @nbibler's reproduction (thanks @nbibler!).

Wrapping the connection checkout in ::Rails.application.executor.wrap{} as @thedarkone suggests does eliminate the warnings, but it seems to be a suboptimal solution to me, as it assumes that a Rails application is present, which it won't necessarily be (this solution won't work for someone who's using Que with ActiveRecord but not the rest of Rails, for example).

I tried to figure out how ActiveRecord in isolation manages its execution context, because I'd like to know more about what's going on under the covers here, but the only place that I can find in ActiveRecord master that interacts with the executor is the query cache. The query cache seemed like a plausible source for this problem (it could try to cache the job lock query, which results in the same job being worked twice and Que thinking the lock has been taken when it hasn't) but I tried disabling it and the warnings continued.

Rails.application.executor refers to an anonymous subclass of ActiveSupport::Executor, while the query cache adds its hooks to the ActiveSupport::Executor class itself. It occurred to me that the following should solve the problem more cleanly (without a dependency on Rails.application), but it doesn't have any effect, and I'm not sure why yet:

if defined?(ActiveSupport::Executor)
  ActiveSupport::Executor.wrap do
    ::ActiveRecord::Base.connection_pool.with_connection(&block)
  end
else
  ::ActiveRecord::Base.connection_pool.with_connection(&block)
end

I'll try to dig into this some more later, but if anyone that knows Rails' internals can point me in the right direction, it'd be much appreciated.

thedarkone commented 7 years ago

@chanks yes, the problem stems from QueryCache (see here), because as things stand today - it does "after request" connection clean up, where "after request" is the executor's outermost block. In terms of que's code base, right now the executor's outermost block ends right after this line (upon the return from run(*attrs[:args]) the current thread's connection was "cleaned up" == returned to the pool).

Up until rails/rails@d3c9d808e3e conn clean up was a responsibility of a separate rails middleware, with that commit this functionality was merged into QueryCache (yes, this is wrong separation of concerns, but things would still be broken even if it was kept separate, as surely that code would still be injected into executor life-cycle and que would still run into trouble 😞). Disabling QueryCache doesn't help, because the code in question still runs (yes, this might be a bug, but the right fix is to move the connection release into its on executor callback, but that still wouldn't help que).

As to how properly handle running que with vanilla active_record: QueryCache's executor callback is installed via a railtie https://github.com/rails/rails/blob/5473e39/activerecord/lib/active_record/railtie.rb#L161, so if run outside of full Rails stack, the code that breaks que is not used. Furthermore ActiveJob hooks into Rails executor via a railtie as well https://github.com/rails/rails/blob/master/activejob/lib/active_job/railtie.rb#L25. In other words, outside of Rails (1) executor doesn't have the clean up code and (2) active job does not hook up into executor.

The correct way to handle this is probably via:

if defined?(Rails)
  Rails.application.executor.wrap do
    ::ActiveRecord::Base.connection_pool.with_connection(&block)
  end
else
  ::ActiveRecord::Base.connection_pool.with_connection(&block)
end

Or que's own railtie could set a "running in Rails, better use an executor" flag (since if que's railtie is run, QueryCache railtie will also run etc).

nbibler commented 7 years ago

Presuming we'd like to keep Rails 4 compatibility in the library, then perhaps doing something like:

if defined?(Rails) && defined?(Rails.application.executor)
  Rails.application.executor.wrap do
    ::ActiveRecord::Base.connection_pool.with_connection(&block)
  end
else
  ::ActiveRecord::Base.connection_pool.with_connection(&block)
end

Would allow existing behavior to continue in Rails 4 and ActiveRecord-only environments. But, if Rails.application.executor exists, it'll instead jump to the Rails 5-compatibility mode and use it.

chanks commented 7 years ago

Alright, I just applied that fix and pushed 0.12.1. Thanks everyone for your help!

nbibler commented 7 years ago

Thanks everyone! The 0.12.1 release appears to fix the issue in the test repo.

wtfiwtz commented 7 years ago

I'm getting a follow-up issue to this with Que 0.13.1 and Que.mode = :async - seems to be related to Rails 5.1.3 connection management in a recently upgraded app (4.2.6 => 5.1.3). Basically rails server, rails console and rspec can often deadlock, and getting these strange errors sometimes too: https://www.postgresql.org/message-id/7205.1262205377%40sss.pgh.pa.us

The solution seems to be switch to :sync mode in the short-term.

I'm looking into this a bit further when I have a chance...

chanks commented 7 years ago

Hmm, I wonder if this Rails.application.executor API changed in 5.1...

On Sun, Aug 6, 2017 at 7:02 AM, Nigel Sheridan-Smith < notifications@github.com> wrote:

I'm getting a follow-up issue to this with Que 0.13.1 and Que.mode = :async - seems to be related to Rails 5.1.3 connection management in a recently upgraded app (4.2.6 => 5.1.3). Basically rails server, rails console and rspec can often deadlock, and getting these strange errors sometimes too: https://www.postgresql.org/message-id/7205.1262205377%40sss.pgh.pa.us

The solution seems to be switch to :sync mode in the short-term.

I'm looking into this a bit further when I have a chance...

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-320499970, or mute the thread https://github.com/notifications/unsubscribe-auth/AASUGjXK8YM5uUthYOcYzfaDxFFt1xW3ks5sVZ1VgaJpZM4JUVge .

chanks commented 7 years ago

I updated @nbibler's example app (which replicates the problem in a test case) and I'm able to confirm that the behavior didn't exist in 5.0.5 but does in 5.1.0. While looking through the 5.1 release notes this note about transactional tests stood out to me as an issue, so I tried setting self.use_transactional_tests = false in the test class and that seems to fix it in the test app.

So I guess the lesson from that is don't use :async inside Rails' transactional tests, but that's a place where you should be using :sync anyway. I don't see any reason yet why it would fail in rails server or rails console...?

chanks commented 7 years ago

Alright, I was able to reproduce the issue in a console session and I'm pretty sure the issue is in our hooking into AR's transaction callbacks to wake a worker after a transaction inserting a job is committed. @wtfiwtz Would you mind pointing your Gemfile to commit ba76522 and see if that fixes the issue for you?

wtfiwtz commented 7 years ago

Hi @chanks thanks for looking at this so quickly! Greatly appreciated!

It initially looked like your change did fix it, but I've run rspec quite a few times and the issue is still there. I've tried pg version 0.19 and 0.20, rails versions 5.0.5 and 5.1.3 on commit ba76522 and they still have the issue.

It definitely seems to be related to the rspec transactional tests, and definitely more frequently with rails 5.1.3 than 5.0.5... I often have to restart Postgres to reset the database. I think the rails console and server deadlocks are just due to the postgres connections getting corrupted.

image

In my most recent testing, rspec tests would succeed up till a point. The last test that succeeded gives this error: PG::UnableToSend: insufficient data in "T" message (in the after test hooks?) and all subsequent tests give PG::ConnectionBad: PQsocket() can't get socket descriptor: ROLLBACK

rspec-log.txt

This is with PostgreSQL 9.6.2.

Technically I don't need to use :async mode anymore (we've moved off Heroku onto my own Digital Ocean box now) but it would be good to figure out why it's now an issue.

I've installed sigdump now so hopefully I'll be able to dump a stack trace when it deadlocks again.

Maybe hold off on any further testing until I can isolate this a bit further ;)... should be able to spend some time on it later tonight...

chanks commented 7 years ago

My understanding from my research today is that Rails changed its behavior inside transactional tests, which is one reason why my reproduction today was failing. But I don't think it should be an issue for users in general, because :async mode isn't really intended to be used inside a transactional test anyway. In testing you should be using :off mode (if you want jobs to simply be stored in the DB, so you can assert on their storage state) or the :sync mode (so jobs are run immediately in the same thread, and you can assert on their effects).

Or am I misunderstanding what you're trying to do?

On Sun, Aug 6, 2017 at 6:00 PM Nigel Sheridan-Smith < notifications@github.com> wrote:

Hi @chanks https://github.com/chanks thanks for looking at this so quickly! Greatly appreciated!

It initially looked like your change did fix it, but I've run rspec quite a few times and the issue is still there. I've tried pg version 0.19 and 0.20, rails versions 5.0.5 and 5.1.3 on commit ba76522 and they still have the issue.

It definitely seems to be related to the rspec transactional tests, and definitely more frequently with rails 5.1.3 than 5.0.5... I often have to restart Postgres to reset the database. I think the rails console and server deadlocks are just due to the postgres connections getting corrupted.

[image: image] https://user-images.githubusercontent.com/1445450/29007373-435a0c20-7b45-11e7-8db5-5da28ae86783.png

In my most recent testing, rspec tests would succeed up till a point. The last test that succeeded gives this error: PG::UnableToSend: insufficient data in "T" message (in the after test hooks?) and all subsequent tests give PG::ConnectionBad: PQsocket() can't get socket descriptor: ROLLBACK

rspec-log.txt https://github.com/chanks/que/files/1203475/rspec-log.txt

This is with PostgreSQL 9.6.2.

Technically I don't need to use :async mode anymore (we've moved off Heroku onto my own Digital Ocean box now) but it would be good to figure out why it's now an issue.

I've installed sigdump now so hopefully I'll be able to dump a stack trace when it deadlocks again.

Maybe hold off on any further testing until I can isolate this a bit further ;)... should be able to spend some time on it later tonight...

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-320535681, or mute the thread https://github.com/notifications/unsubscribe-auth/AASUGmw5_ClpWKVEzDa-9EG2gykAB7Uyks5sVjeOgaJpZM4JUVge .

wtfiwtz commented 7 years ago

Hi @chanks,

I think you are right - I've reviewed the changes for transactional tests and that is definitely a result of the threads sharing the DB connection during rollbacks with some sort of race condition.

https://github.com/rails/rails/pull/28083

Also some related discussions here:

https://github.com/rails/rails/issues/28466 https://github.com/rails/rails/issues/29164 https://github.com/rosenfeld/rspec_nested_transactions/issues/2

So the solution is to disable Que's :async mode, or disable config.use_transactional_fixtures = false in Rspec, or self.use_transactional_tests = false inside individual tests... there isn't any way to disable the new behaviour in Rails 5.1.x otherwise from what I can see. As you have said above, the :sync mode is more appropriate for tests anyway, so it tests your job's behaviour at the same time.

Not yet sure why it deadlocked on Rails 5.0.5 but maybe that was a one-off... 🙁

Thanks for your help! Something to be aware of in multi-threaded Rails code from now on.

chanks commented 7 years ago

No problem!

So, I fixed the issue temporarily by just commenting out the AR transaction hooks to wake a local worker after the transaction commits. I'm not sure that it's super-necessary to keep that code, since most people don't run the workers in their web processes anymore, I think, and LISTEN/NOTIFY is the better long-term solution anyway.

@joevandyk you wrote that code originally (and I think you have a better understanding of the AR transaction hooks than I do), is it still important for you and do you want to take a stab at updating it for Rails 5.1? If not I'm inclined to just remove it.

chanks commented 7 years ago

@wtfiwtz I just released 0.14.0 without the AR transaction hook. Thanks for bringing this to my attention!

joevandyk commented 7 years ago

I'd remove it.

On Tue, Aug 8, 2017 at 7:11 AM Chris Hanks notifications@github.com wrote:

Yeah.

So, I fixed the issue temporarily by just commenting out the AR transaction hooks to wake a local worker after the transaction commits. I'm not sure that it's super-necessary to keep that code, since most people don't run the workers in their web processes anymore, I think, and LISTEN/NOTIFY is the better long-term solution anyway.

@joevandyk https://github.com/joevandyk you wrote that code originally (and I think you have a better understanding of the AR transaction hooks than I do), is it still important for you and do you want to take a stab at updating it for Rails 5.1? If not I'm inclined to just remove it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-320968064, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAEmxdNocXusN3ZoOcJqnMXdjjkFEf-ks5sWGycgaJpZM4JUVge .

wtfiwtz commented 7 years ago

Thanks @chanks! I appreciate your efforts to jump on this so quickly!