Closed bramski closed 7 years ago
Hello, I have the same issue with my Rails 5 setup.
I get the WARNING: you don't own a lock type of ExclusiveLock
too. Running ruby 2.3.1p112
, rails 5.0.0.1
and psql (PostgreSQL) 9.4.4
. Almost a vanilla project, I doubt anything is conflicting.
Really unfortunate issue because the gem seems great otherwise.
You could use the apparaisal gem to test que specs against multiple rails versions.
I'm seeing the same WARNING here. Vanilla Rails5, single que worker.
This issue should be re-opened. The que README should contain a clear warning that que is not compatible with Rails5.
Alright, issue reopened. I'm happy to accept a patch that isn't too invasive to Que's connection management.
Since nothing changed in Que, it seems like this must be from some kind of change on the Rails/ActiveRecord side.
Has anybody been able to confirm that this issue is not present in a blank Rails 4.x app but is present in a blank Rails 5.x app?
hey folks, I'd love to dig into this with some of my spare time these next couple weeks. Looking back through the issue, though, I don't see anything linking to a basic Rails app that reproduces this problem.
Can somebody who has encountered this issue please just create a brand new Rails app that has this problem and link to the Github repo with an explanation of how you trigger the errors? Ideally it would have as few changes as necessary from the initial generated app.
@bgentry: I created a minimum viable Rails 5 + Queue 0.12 app which shows the issue very apparently:
https://github.com/nbibler/rails-5-que-exclusivelocking
You can see the commit history on that repo contains nothing unusual other than forcing Que to async mode in tests and avoiding the Rails ActiveJob::TestCase
inheritance to allow Que to be used in test mode.
@nbibler thank you!
So what happens is that since Rails 5 it is expected that app servers/job queues are using an executor (Rails.application.executor
) or a reloader (Rails.application.reloader
) when entering "Rails application space" (my wording, not an official moniker). This is required so that Rails is able to maintain caches/connections/code reloading etc. ActiveJob
does this automatically for que
here: https://github.com/rails/rails/blob/93c9534/activejob/lib/active_job/queue_adapters/que_adapter.rb#L32 (via ActiveJob::Base.execute
).
The problem is that que
is entering Rails application space a bit earlier than expected. The solution is to declare the intent like this:
diff --git a/lib/que/adapters/active_record.rb b/lib/que/adapters/active_record.rb
index 083e1e6..6feeada 100644
--- a/lib/que/adapters/active_record.rb
+++ b/lib/que/adapters/active_record.rb
@@ -59,7 +59,9 @@ module Que
private
def checkout_activerecord_adapter(&block)
- ::ActiveRecord::Base.connection_pool.with_connection(&block)
+ ::Rails.application.executor.wrap do
+ ::ActiveRecord::Base.connection_pool.with_connection(&block)
+ end
end
end
end
Que::Adapters::ActiveRecord#cleanup!
's call into ::ActiveRecord::Base.clear_active_connections!
thus becomes probably unnecessary as Rails does that automatically upon leaving an outer most executor block (executor is reentrant).
For more info there is a WIP guide rails/rails#27494 in the works.
Thanks thedarkone, very interesting. Tomorrow is my open source day at work so I should be able to review all that and probably make a bugfix release.
On Thu, Jan 19, 2017 at 9:21 PM thedarkone notifications@github.com wrote:
@nbibler https://github.com/nbibler thank you!
So what happens is that since Rails 5 it is expected that app servers/job queues are using an executor (Rails.application.executor) or a reloader ( Rails.application.reloader) when entering "Rails application space" (my wording, not an official moniker). This is required so that Rails is able to maintain caches/connections/code reloading etc. ActiveJob does this automatically for que here: https://github.com/rails/rails/blob/93c9534/activejob/lib/active_job/queue_adapters/que_adapter.rb#L32 (via ActiveJob::Base.execute).
The problem is that que is entering Rails application space a bit earlier than expected. The solution is to declare the intent like this:
diff --git a/lib/que/adapters/active_record.rb b/lib/que/adapters/active_record.rb
index 083e1e6..6feeada 100644
--- a/lib/que/adapters/active_record.rb
+++ b/lib/que/adapters/active_record.rb
@@ -59,7 +59,9 @@ module Que
private def checkout_activerecord_adapter(&block)
::ActiveRecord::Base.connection_pool.with_connection(&block)
::Rails.application.executor.wrap do
::ActiveRecord::Base.connection_pool.with_connection(&block)
end
end
end
end
Que::Adapters::ActiveRecord#cleanup!'s call into ::ActiveRecord::Base.clear_active_connections! thus becomes probably unnecessary as Rails does that automatically https://github.com/rails/rails/blob/3c785bd/activerecord/lib/active_record/query_cache.rb#L39 upon leaving an outer most executor block (executor is reentrant).
For more info there is a WIP guide rails/rails#27494 https://github.com/rails/rails/pull/27494 in the works.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-273959427, or mute the thread https://github.com/notifications/unsubscribe-auth/AASUGtMjw6KwlR12jSCHsRAzWsilRtkCks5rUBoagaJpZM4JUVge .
@nbibler thank you for being a good OSS citizen and providing a reproduction!
@thedarkone thanks for so quickly finding the issue afterwards and suggesting a fix :clap:
@chanks lmk if I can help, but it looks like you've got a handle on this one now 👍
I've just spent some time investigating this with @nbibler's reproduction (thanks @nbibler!).
Wrapping the connection checkout in ::Rails.application.executor.wrap{}
as @thedarkone suggests does eliminate the warnings, but it seems to be a suboptimal solution to me, as it assumes that a Rails application is present, which it won't necessarily be (this solution won't work for someone who's using Que with ActiveRecord but not the rest of Rails, for example).
I tried to figure out how ActiveRecord in isolation manages its execution context, because I'd like to know more about what's going on under the covers here, but the only place that I can find in ActiveRecord master that interacts with the executor is the query cache. The query cache seemed like a plausible source for this problem (it could try to cache the job lock query, which results in the same job being worked twice and Que thinking the lock has been taken when it hasn't) but I tried disabling it and the warnings continued.
Rails.application.executor
refers to an anonymous subclass of ActiveSupport::Executor, while the query cache adds its hooks to the ActiveSupport::Executor class itself. It occurred to me that the following should solve the problem more cleanly (without a dependency on Rails.application
), but it doesn't have any effect, and I'm not sure why yet:
if defined?(ActiveSupport::Executor)
ActiveSupport::Executor.wrap do
::ActiveRecord::Base.connection_pool.with_connection(&block)
end
else
::ActiveRecord::Base.connection_pool.with_connection(&block)
end
I'll try to dig into this some more later, but if anyone that knows Rails' internals can point me in the right direction, it'd be much appreciated.
@chanks yes, the problem stems from QueryCache
(see here), because as things stand today - it does "after request" connection clean up, where "after request" is the executor's outermost block. In terms of que
's code base, right now the executor's outermost block ends right after this line (upon the return from run(*attrs[:args])
the current thread's connection was "cleaned up" == returned to the pool).
Up until rails/rails@d3c9d808e3e conn clean up was a responsibility of a separate rails middleware, with that commit this functionality was merged into QueryCache
(yes, this is wrong separation of concerns, but things would still be broken even if it was kept separate, as surely that code would still be injected into executor life-cycle and que
would still run into trouble 😞). Disabling QueryCache
doesn't help, because the code in question still runs (yes, this might be a bug, but the right fix is to move the connection release into its on executor callback, but that still wouldn't help que
).
As to how properly handle running que
with vanilla active_record
: QueryCache
's executor callback is installed via a railtie https://github.com/rails/rails/blob/5473e39/activerecord/lib/active_record/railtie.rb#L161, so if run outside of full Rails stack, the code that breaks que
is not used. Furthermore ActiveJob
hooks into Rails executor via a railtie as well https://github.com/rails/rails/blob/master/activejob/lib/active_job/railtie.rb#L25. In other words, outside of Rails (1) executor doesn't have the clean up code and (2) active job does not hook up into executor.
The correct way to handle this is probably via:
if defined?(Rails)
Rails.application.executor.wrap do
::ActiveRecord::Base.connection_pool.with_connection(&block)
end
else
::ActiveRecord::Base.connection_pool.with_connection(&block)
end
Or que
's own railtie could set a "running in Rails, better use an executor" flag (since if que
's railtie is run, QueryCache
railtie will also run etc).
Presuming we'd like to keep Rails 4 compatibility in the library, then perhaps doing something like:
if defined?(Rails) && defined?(Rails.application.executor)
Rails.application.executor.wrap do
::ActiveRecord::Base.connection_pool.with_connection(&block)
end
else
::ActiveRecord::Base.connection_pool.with_connection(&block)
end
Would allow existing behavior to continue in Rails 4 and ActiveRecord-only environments. But, if Rails.application.executor
exists, it'll instead jump to the Rails 5-compatibility mode and use it.
Alright, I just applied that fix and pushed 0.12.1. Thanks everyone for your help!
Thanks everyone! The 0.12.1 release appears to fix the issue in the test repo.
I'm getting a follow-up issue to this with Que 0.13.1 and Que.mode = :async
- seems to be related to Rails 5.1.3 connection management in a recently upgraded app (4.2.6 => 5.1.3). Basically rails server
, rails console
and rspec
can often deadlock, and getting these strange errors sometimes too:
https://www.postgresql.org/message-id/7205.1262205377%40sss.pgh.pa.us
The solution seems to be switch to :sync
mode in the short-term.
I'm looking into this a bit further when I have a chance...
Hmm, I wonder if this Rails.application.executor API changed in 5.1...
On Sun, Aug 6, 2017 at 7:02 AM, Nigel Sheridan-Smith < notifications@github.com> wrote:
I'm getting a follow-up issue to this with Que 0.13.1 and Que.mode = :async - seems to be related to Rails 5.1.3 connection management in a recently upgraded app (4.2.6 => 5.1.3). Basically rails server, rails console and rspec can often deadlock, and getting these strange errors sometimes too: https://www.postgresql.org/message-id/7205.1262205377%40sss.pgh.pa.us
The solution seems to be switch to :sync mode in the short-term.
I'm looking into this a bit further when I have a chance...
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-320499970, or mute the thread https://github.com/notifications/unsubscribe-auth/AASUGjXK8YM5uUthYOcYzfaDxFFt1xW3ks5sVZ1VgaJpZM4JUVge .
I updated @nbibler's example app (which replicates the problem in a test case) and I'm able to confirm that the behavior didn't exist in 5.0.5 but does in 5.1.0. While looking through the 5.1 release notes this note about transactional tests stood out to me as an issue, so I tried setting self.use_transactional_tests = false
in the test class and that seems to fix it in the test app.
So I guess the lesson from that is don't use :async
inside Rails' transactional tests, but that's a place where you should be using :sync
anyway. I don't see any reason yet why it would fail in rails server
or rails console
...?
Alright, I was able to reproduce the issue in a console session and I'm pretty sure the issue is in our hooking into AR's transaction callbacks to wake a worker after a transaction inserting a job is committed. @wtfiwtz Would you mind pointing your Gemfile to commit ba76522
and see if that fixes the issue for you?
Hi @chanks thanks for looking at this so quickly! Greatly appreciated!
It initially looked like your change did fix it, but I've run rspec
quite a few times and the issue is still there. I've tried pg
version 0.19 and 0.20, rails
versions 5.0.5 and 5.1.3 on commit ba76522
and they still have the issue.
It definitely seems to be related to the rspec transactional tests, and definitely more frequently with rails 5.1.3 than 5.0.5... I often have to restart Postgres to reset the database. I think the rails console and server deadlocks are just due to the postgres connections getting corrupted.
In my most recent testing, rspec tests would succeed up till a point. The last test that succeeded gives this error: PG::UnableToSend: insufficient data in "T" message
(in the after test hooks?) and all subsequent tests give PG::ConnectionBad: PQsocket() can't get socket descriptor: ROLLBACK
This is with PostgreSQL 9.6.2.
Technically I don't need to use :async
mode anymore (we've moved off Heroku onto my own Digital Ocean box now) but it would be good to figure out why it's now an issue.
I've installed sigdump
now so hopefully I'll be able to dump a stack trace when it deadlocks again.
Maybe hold off on any further testing until I can isolate this a bit further ;)... should be able to spend some time on it later tonight...
My understanding from my research today is that Rails changed its behavior inside transactional tests, which is one reason why my reproduction today was failing. But I don't think it should be an issue for users in general, because :async mode isn't really intended to be used inside a transactional test anyway. In testing you should be using :off mode (if you want jobs to simply be stored in the DB, so you can assert on their storage state) or the :sync mode (so jobs are run immediately in the same thread, and you can assert on their effects).
Or am I misunderstanding what you're trying to do?
On Sun, Aug 6, 2017 at 6:00 PM Nigel Sheridan-Smith < notifications@github.com> wrote:
Hi @chanks https://github.com/chanks thanks for looking at this so quickly! Greatly appreciated!
It initially looked like your change did fix it, but I've run rspec quite a few times and the issue is still there. I've tried pg version 0.19 and 0.20, rails versions 5.0.5 and 5.1.3 on commit ba76522 and they still have the issue.
It definitely seems to be related to the rspec transactional tests, and definitely more frequently with rails 5.1.3 than 5.0.5... I often have to restart Postgres to reset the database. I think the rails console and server deadlocks are just due to the postgres connections getting corrupted.
[image: image] https://user-images.githubusercontent.com/1445450/29007373-435a0c20-7b45-11e7-8db5-5da28ae86783.png
In my most recent testing, rspec tests would succeed up till a point. The last test that succeeded gives this error: PG::UnableToSend: insufficient data in "T" message (in the after test hooks?) and all subsequent tests give PG::ConnectionBad: PQsocket() can't get socket descriptor: ROLLBACK
rspec-log.txt https://github.com/chanks/que/files/1203475/rspec-log.txt
This is with PostgreSQL 9.6.2.
Technically I don't need to use :async mode anymore (we've moved off Heroku onto my own Digital Ocean box now) but it would be good to figure out why it's now an issue.
I've installed sigdump now so hopefully I'll be able to dump a stack trace when it deadlocks again.
Maybe hold off on any further testing until I can isolate this a bit further ;)... should be able to spend some time on it later tonight...
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-320535681, or mute the thread https://github.com/notifications/unsubscribe-auth/AASUGmw5_ClpWKVEzDa-9EG2gykAB7Uyks5sVjeOgaJpZM4JUVge .
Hi @chanks,
I think you are right - I've reviewed the changes for transactional tests and that is definitely a result of the threads sharing the DB connection during rollbacks with some sort of race condition.
https://github.com/rails/rails/pull/28083
Also some related discussions here:
https://github.com/rails/rails/issues/28466 https://github.com/rails/rails/issues/29164 https://github.com/rosenfeld/rspec_nested_transactions/issues/2
So the solution is to disable Que's :async
mode, or disable config.use_transactional_fixtures = false
in Rspec, or self.use_transactional_tests = false
inside individual tests... there isn't any way to disable the new behaviour in Rails 5.1.x otherwise from what I can see. As you have said above, the :sync
mode is more appropriate for tests anyway, so it tests your job's behaviour at the same time.
Not yet sure why it deadlocked on Rails 5.0.5 but maybe that was a one-off... 🙁
Thanks for your help! Something to be aware of in multi-threaded Rails code from now on.
No problem!
So, I fixed the issue temporarily by just commenting out the AR transaction hooks to wake a local worker after the transaction commits. I'm not sure that it's super-necessary to keep that code, since most people don't run the workers in their web processes anymore, I think, and LISTEN/NOTIFY is the better long-term solution anyway.
@joevandyk you wrote that code originally (and I think you have a better understanding of the AR transaction hooks than I do), is it still important for you and do you want to take a stab at updating it for Rails 5.1? If not I'm inclined to just remove it.
@wtfiwtz I just released 0.14.0 without the AR transaction hook. Thanks for bringing this to my attention!
I'd remove it.
On Tue, Aug 8, 2017 at 7:11 AM Chris Hanks notifications@github.com wrote:
Yeah.
So, I fixed the issue temporarily by just commenting out the AR transaction hooks to wake a local worker after the transaction commits. I'm not sure that it's super-necessary to keep that code, since most people don't run the workers in their web processes anymore, I think, and LISTEN/NOTIFY is the better long-term solution anyway.
@joevandyk https://github.com/joevandyk you wrote that code originally (and I think you have a better understanding of the AR transaction hooks than I do), is it still important for you and do you want to take a stab at updating it for Rails 5.1? If not I'm inclined to just remove it.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-320968064, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAEmxdNocXusN3ZoOcJqnMXdjjkFEf-ks5sWGycgaJpZM4JUVge .
Thanks @chanks! I appreciate your efforts to jump on this so quickly!
Hello,
We're having a consistently repeatable problem over here of what we think is a fundamental problem of the MVCC advisory lock model in use here.
I see this code, and it doesn't seem to be effective: https://github.com/chanks/que/blob/master/lib/que/job.rb#L88
Screenshot of our logs:
Our Job Runner Looks like:
We've added Rails cache which uses redis to ensure that the jobs aren't being run twice.