Closed bramski closed 7 years ago
Does the WARNING: you don't own a lock type of ExclusiveLock
have something to do with this?
That warning message would come up if you try to release an advisory lock that hasn't been taken by that connection (it might also come up in other cases). Are you using pgbouncer, or some other connection pool outside the app? That may cause that message. Or, are you doing anything unusual with your PG connections?
Are you sure the job is being destroyed after it is run? Que tries to destroy jobs on its own that you haven't explicitly called destroy on, but it uses a @destroyed
instance variable in the job instance itself to determine whether that is necessary, and your logic may be writing to that variable? Or possibly you're calling destroy
in a savepoint that's getting rolled back while the main transaction continues to completion? The fact that you've marked destroy
as public is a little suspicious to me, but I don't know what your actual job logic looks like.
It's just Que & AR. No pgbouncer. Destroy was made public so that we can put it inside the transaction explicitly and expose it to the AJ job. The transaction isn't cancelling or getting rolled back. In fact in the log you're seeing there is no transaction on the job which is repeating. Adding a transaction to that job made it less consistently repeatable, but still repeatable.
The job is definitely destroyed. We added an AR model to check it's existence when being run and it's typically not there. About 80%, but it's still possible that the AR connection thinks the job is there. I'm guessing due to the same MVCC problem that the code states. Redis-style locking manages to be consistent here.
Is this running in Rails production mode?
Seeing it in staging
and production
.
We've tuned the que worker to be more rapidly responsive. We were seeing this rather intermittently and then tuned the worker wake interval to 0.01
and started seeing this much more consistently.
That's not a ridiculous wake interval. I still suspect that there's something unusual about your setup, or perhaps your Postgres configuration? Have you changed the default transaction isolation, or are you somehow running the workers themselves inside of a transaction?
I doubt that there's anything fundamentally unexpected about advisory locks and how they interact with MVCC, or it would have come up sometime in the last three years. Of course, if you had a self-contained reproduction that demonstrated the issue, that would be helpful.
Also, I don't know what's causing those "you don't own a lock type of ExclusiveLock" warnings, those aren't a normal thing and are a signifier that something is wrong.
Postgres is vanilla heroku.
Version:
PostgreSQL 9.5.2 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit
We haven't done anything to the transaction isolation. Procfile:
web: bundle exec puma -C config/puma.rb
worker: bundle exec que --worker-count $WORKER_QUE_WORKERS --wake-interval $WORKER_QUE_WAKE_INTERVAL ./config/environment.rb
Postgres Settings:
postgres_settings.txt
it's actually a csv. so rename and open. GH doesn't seem to like csv files.
Looks like read_committed
is the default for the heroku PG instances.
Is there anything in your code/job that could be starting new database connections? Forking processes for example? Or stopping processes?
That warning indicates that the advisory lock is being dropped and so it would make sense that Que would re-run the job.
You are looking at all the code pre-AJ @joevandyk . Nothing fancy after that. Just business logic.
We haven't changed the Que.mode, but we are running a separate que process.
Correction:
Que.mode = :off
in our staging environment.
I can repro this consistently locally on my own postgres. Setting concurrent worker count to 1 this doesn't repro itself. So I suspect this is because the unlock is happening on a different connection. I'm on rails 5.0.0 and using the AR adapter.
Hm.. I haven't tried Rails 5 yet. Wonder if 5 is doing something different?
WRT to the active record adapter it probably is.
It appears that work
checks out a connection... but every internal call to execute also checks out a new connection as well.
Sounds like our interface to the connection pool broke somehow under Rails 5.
On Monday, July 25, 2016, Bram Whillock notifications@github.com wrote:
It appears that work checks out a connection... but every internal call to execute also checks out a new connection as well.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-235118149, or mute the thread https://github.com/notifications/unsubscribe-auth/AASUGpsyOxxWG5savLJJIUOGd2xez5mFks5qZUdwgaJpZM4JUVge .
Rails 5 has taken a more predictable approach than previous comments make me think. Everytime you checkout it gives you a different connection.
And the pool checks out new connections for different threads; but then will give whichever connection is available to your thread next you checkout. So the more threads you have, the less likely the next checkout will give you the same thread.
I'll do my best to refactor and issue a PR that makes sense.
I'm on my phone so I can't check, but the AR adapter uses with_connection, right? Does that not do the same thing anymore? Or is it no longer reentrant?
On Monday, July 25, 2016, Bram Whillock notifications@github.com wrote:
I'll do my best to refactor and issue a PR that makes sense.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-235121111, or mute the thread https://github.com/notifications/unsubscribe-auth/AASUGk08RkiGmIlBKOETn27Uy0JZVhKPks5qZUt_gaJpZM4JUVge .
It appears to no longer be reentrant.
I've attempted it with ActiveRecord::Base.connection
and that doesn't seem to yield the proper result either. The connection pool reaper will recycle that connection if your worker does blocking IO.
Ensuring that the worker loop uses the same connection seems paramount to actually executing the advisory locks. Why not refactor the code in such a way that ensures that behavior?
Might be worth it somehow to setup a concurrent integration test.
On Monday, July 25, 2016, Chris Hanks notifications@github.com wrote:
Sounds like our interface to the connection pool broke somehow under Rails 5.
On Monday, July 25, 2016, Bram Whillock <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:
It appears that work checks out a connection... but every internal call to execute also checks out a new connection as well.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-235118149, or mute the thread < https://github.com/notifications/unsubscribe-auth/AASUGpsyOxxWG5savLJJIUOGd2xez5mFks5qZUdwgaJpZM4JUVge
.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-235120178, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAEm9QtDvTVDtaBchSJ1OjA4r180uQTks5qZUpCgaJpZM4JUVge .
https://github.com/chanks/que/pull/167 should fix this. Not my favorite fix; but it avoids doing a major refactor that would prevent all of the re-entrant checkouts that are currently done on the adapter.
There are other problems in master that prevent that fix from working well. Internal errors cause the connections not to be cleaned up and then the pool is stuck without a way to checkout a new connection. Please let me know if you guys will have time to fix the problems or else I will have to take up this work myself.
I've long since started moving apps off of ActiveRecord, and I'm not familiar with the changes that have gone into it for 5.0 (I've been relying on pull requests for Rails-specific functionality for a while now). I'm also not really willing to completely refactor the worker system to support it, when a reentrant method that yields a connection for the current thread is a much simpler API to rely on (and one that 1.0 relies even more heavily on, btw).
That said, there must be some way to get this behavior from ActiveRecord 5.0 without too much code.
This issue just ran my PG heroku instance out of memory. At about 3am. Not awesome. Any update @chanks ?
Haven't had time to look into this. Input from someone who knows ActiveRecord 5 internals would be appreciated.
On Friday, August 5, 2016, Bram Whillock notifications@github.com wrote:
This issue just ran my PG heroku instance out of memory. At about 3am. Not awesome. Any update @chanks https://github.com/chanks ?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-237868433, or mute the thread https://github.com/notifications/unsubscribe-auth/AASUGpBUOxfs5yllEeJxkhJ5xLdLUH5nks5qc0tagaJpZM4JUVge .
I'm seeing this problem as well. I really enjoy Que + AR and hope this can be resolved. @bramski Have you had any luck?
Zero. We had to write a redis workaround. We will switch to rabbit mq. No solutions presented here for rails 5.
On Aug 30, 2016 4:26 PM, "Eric Boehs" notifications@github.com wrote:
I'm seeing this problem as well. I really enjoy Que + AR and hope this can be resolved. @bramski https://github.com/bramski Have you had any luck?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-243613154, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZvldnxa7_BmPJiEEaQKAiBGChVJTMYks5qlLwTgaJpZM4JUVge .
Damn... well this is really sad. I was really looking forward to using this on a really important project.
@bramski so your PR doesn't actually help the issue then?
Nope. We moved to sidekiq.
On Oct 7, 2016 6:34 PM, "Travis" notifications@github.com wrote:
Damn... well this is really sad. I was really looking forward to using this on a really important project.
@bramski https://github.com/bramski so your PR doesn't actually help the issue then?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-252390322, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZvlWHE2jjWGKgvWwWnT5PT5e5poiDPks5qxuUagaJpZM4JUVge .
Is there a way to do super reliable exactly-once processing with sidekiq? Everything I read makes it sound amazing for performance and whatnot, but I have stuff thats absolutely critical to be ran on time, reliably (dealing with payment processing). That's why I liked the ACID guarantees of Que (and Delayed Job is too slow)
Not of importance to us.
On Oct 9, 2016 4:55 PM, "Travis" notifications@github.com wrote:
Is there a way to do super reliable exactly-once processing with sidekiq? Everything I read makes it sound amazing for performance and whatnot, but I have stuff thats absolutely critical to be ran on time, reliably (dealing with payment processing). That's why I liked the ACID guarantees of Que (and Delayed Job is too slow)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-252519067, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZvlRJZn_A_WP5KW5sbbrfTR-VTldxyks5qyXDagaJpZM4JUVge .
Sidekiq Pro offers this guarantee. I've never dug that far into it's implementation, but we use it in an app at my day job and there haven't been problems with it that I recall.
If you're interested in a job queue with a larger community that can offer better and easier Rails/ActiveRecord integration, and ACID guarantees aren't important to you, it's a great choice!
On Sunday, October 9, 2016, Bram Whillock notifications@github.com wrote:
Not of importance to us.
On Oct 9, 2016 4:55 PM, "Travis" <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:
Is there a way to do super reliable exactly-once processing with sidekiq? Everything I read makes it sound amazing for performance and whatnot, but I have stuff thats absolutely critical to be ran on time, reliably (dealing with payment processing). That's why I liked the ACID guarantees of Que (and Delayed Job is too slow)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-252519067, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZvlRJZn_A_ WP5KW5sbbrfTR-VTldxyks5qyXDagaJpZM4JUVge .
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-252519357, or mute the thread https://github.com/notifications/unsubscribe-auth/AASUGpXj2y3BFYBEdfvtbn7kij3vFM5Qks5qyXIRgaJpZM4JUVge .
The Rails docs still suggest that the with_connection
behavior is unchanged. The only changes to that code I see are from @thedarkone but I don't see any obvious problems that would have broken this behavior. https://github.com/rails/rails/commit/603fe20c0b8c05bc1cca8d01caadbd7060518141
Since I've been pinged: with_connection
is reentrant.
ActiveRecord::Base.connection_pool.with_connection do |conn_outer|
ActiveRecord::Base.connection_pool.with_connection do |conn_inner|
conn_outer == conn_inner # => true
ActiveRecord::Base.connection_pool.with_connection do |another_conn_inner|
conn_outer == another_conn_inner # => true
conn_outer == conn_inner # => true
conn_inner == another_conn_inner # => true
end
end
ActiveRecord::Base.connection_pool.with_connection do |repeated_conn_inner|
conn_outer == repeated_conn_inner # => true
end
end
Leaving the outer-most with_connection
doesn't guarantee that next time the pool will provide the same connection (obviously):
ActiveRecord::Base.connection_pool.with_connection do |conn|
@leaked_conn = conn
end
ActiveRecord::Base.connection_pool.with_connection do |conn|
@leaked_conn == conn # => might be true or false, depends on the conn pool state/other threads etc.
end
Note that manually returning a conn to the pool breaks with_connection
contract:
ActiveRecord::Base.connection_pool.with_connection do |conn|
# breaking with_connection contract:
ActiveRecord::Base.connection_pool.checkin conn
# see also pool.release_connection, pool.remove(conn), or ActiveRecord::Base.clear_active_connections!
ActiveRecord::Base.connection_pool.with_connection do |conn_inner|
conn_inner == conn # => might be true or false, depends on the conn pool state/other threads etc.
end
end
My beast guess at what is happening: 3-rd party or app code, is manually releasing/returning the connection to the pool.
Hmm, that's interesting. Thanks for the reassurance, @thedarkone.
This doesn't seem to be a Que issue, so I'm going to close it. If anyone has any ideas on how to (cleanly) prevent this kind of occurrence, I'm open to hearing them, but I think it's reasonable for us to assume that with_connection
is going to be both reentrant and hold the connection for the length of the block, since it's designed that way. Anyone else's code that breaks that contract is probably behaving irresponsibly, I would say.
This is reproduceable with vanilla rails 5 and que. Why would you close it? The issue is not fixed.
On Nov 7, 2016 7:22 AM, "Chris Hanks" notifications@github.com wrote:
Closed #166 https://github.com/chanks/que/issues/166.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#event-850037021, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZvlcTot0nnq_tYhLXsehfjMoX3sH5aks5q70I3gaJpZM4JUVge .
Has anyone put together a reproduction with vanilla Rails 5?
The original issue this was linked to was vanilla AR with que.
On Nov 7, 2016 7:34 AM, "Chris Hanks" notifications@github.com wrote:
Has anyone put together a reproduction with vanilla Rails 5?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-258868104, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZvlVmbvmeWvCHwcIy0q57uty2AxwY9ks5q70UDgaJpZM4JUVge .
And no other gems? It looks like there was something in your setup that was breaking the with_connection contract.
I know nothing about Que, @chanks can you test it? Or if I wanted to poke at this, all I need is postgres based vanilla Rails 5 app and then I just follow the README.md
's Installation/Usage sections, right?
@thedarkone yes, it should be pretty straightforward, though I haven't actually used Rails 5 myself, so it's possible something's missing.
Hello,
We're having a consistently repeatable problem over here of what we think is a fundamental problem of the MVCC advisory lock model in use here.
I see this code, and it doesn't seem to be effective: https://github.com/chanks/que/blob/master/lib/que/job.rb#L88
Screenshot of our logs:
Our Job Runner Looks like:
We've added Rails cache which uses redis to ensure that the jobs aren't being run twice.