Closed aetherknight closed 10 years ago
Hi @tenderlove, Your change to #active?
in the mentioned commit (https://github.com/rails/rails/commit/34c7e73c1def1312e59ef1f334586ff2f668246e) 8 months ago has been causing us some major pain with recovering from dead database connections. What sort of trouble did a select 1
query cause during the connection pool's reaping?
Reaping happens in a background thread. We can't send any data over the socket because there are no locks around the file descriptor (that I know of).
Aaron Patterson http://tenderlovemaking.com/ I'm on an iPhone so I apologize for top posting.
On Nov 12, 2013, at 2:59 PM, "William (B.J.) Snow Orvis" notifications@github.com wrote:
Hi @tenderlove, Your change to #active? in the mentioned commit (34c7e73) 8 months ago has been causing us some major pain with recovering from dead database connections. What sort of trouble did a select 1 query cause during the connection pool's reaping?
\ Reply to this email directly or view it on GitHub.
Awesome bug report @aetherknight. I'm also experiencing a variant of this problem in my application - the StatementPool#connection_active?
method is also broken in a similar way.
Here's the related error which can occur when calling establish_connection
:
PG::UnableToSend: SSL SYSCALL error: Connection timed out
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/postgresql_adapter.rb:517:in `exec`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/postgresql_adapter.rb:517:in `dealloc`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/postgresql_adapter.rb:500:in `block in clear`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/postgresql_adapter.rb:499:in `each_value`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/postgresql_adapter.rb:499:in `clear`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/postgresql_adapter.rb:563:in `clear_cache!`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/abstract_adapter.rb:339:in `disconnect!`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/postgresql_adapter.rb:588:in `disconnect!`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/abstract/connection_pool.rb:310:in `block (2 levels) in disconnect!`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/abstract/connection_pool.rb:308:in `each`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/abstract/connection_pool.rb:308:in `block in disconnect!`
/usr/local/rbenv/versions/2.0.0-p247/lib/ruby/2.0.0/monitor.rb:211:in `mon_synchronize`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/abstract/connection_pool.rb:306:in `disconnect!`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/abstract/connection_pool.rb:564:in `remove_connection`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_handling.rb:88:in `remove_connection`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_handling.rb:45:in `establish_connection`
So I think there may be a few possible approaches to fixing this:
#active?
and #connection_active?
methods not particularly meaningful anymore (since you discover a dead connection by attempting to do work on it), and reaping stale connections in the ConnectionManager would need to only be based on connections that weren't checked back in (the current logic only reaps a connection that is in use, is stale, and is no longer active).#active?
and #connection_active?
then would probably need to handle timeouts on the lock, or their caller would need to handle that case.For the later approach, #active?
and #connection_active?
would then do work on the connection (eg, go back to select 1
within the synchronize/monitor/lock code.
Yep. The former definitely makes more sense to me.
@aetherknight I agree, we should go with solution #1. Seems the mysql adapter has a setting to automatically do this, but the pg adapter does not. We'll have to rescue exceptions wherever we call methods on the connection, then call PGConn#reset. Though we should probably do a loop and PGConn#reset_set + PGConn#reset_poll so we don't block the process.
Will another stacktrace help out here? I'm getting intermittent ruby segfaults with a Capybara test suite https://gist.github.com/bradleypriest/7660331 in the same place
Same issue, using sidekiq but not sure that is the main cause of problems. Any fix already pending? My app totally broke and gives the ActiveRecord::ConnectionTimeoutError - could not obtain a database connection within ... in about 1 or 2 page reloads.
Is there a temporarily fix for this issue? Its preventing my app to run at all locally or in production. Tried increasing pg pool, timeout all fails this problem comes back every time. Connection pool is now at 100 which still doesn't help. There are only 10 active pg connections. How is the rest of the world running there Rails apps on PG when this issue persist?
Stacktrace: https://gist.github.com/rubytastic/7786482
@rubytastic You could determine why your db connections are dying so quickly. I don't have any pgsql experience but I know MySQL has a timeout setting where the server will kill any clients that haven't used the socket in that time period.
For the short term, we monkeypatched The PostgreSQLAdapter to use the older
version of #active? that performed a select 1
query, since we aren't
using Ruby threading to handle multiple requests, nor are we using the
reaper in its own thread.
On Wed, Dec 4, 2013 at 1:37 PM, Mike Perham notifications@github.comwrote:
@rubytastic https://github.com/rubytastic You could determine why your db connections are dying so quickly. I don't have any pgsql experience but I know MySQL has a timeout setting where the server will kill any clients that haven't used the socket in that time period.
— Reply to this email directly or view it on GitHubhttps://github.com/rails/rails/issues/12867#issuecomment-29848045 .
@aetherknight Unfortunately that's not an option for Sidekiq users.
@mperham Ah, good point.
@aetherknight Total PG newbie just switched from Mysql. Understanding correctly there is no solid solution nor will it be implemented in the short terms. Than this is a serious issue for PG and Sidekiq(?) combination, I wonder why so little people have this issue.
So switching back to mysql, for now would be the best option to go or is there a monkey patch that could be implemented to fix the problem temporarily
@tenderlove by my reading, solution 2 is necessary.
If you can't tell whether a connection is #active?, you either never reap it, or you regularly reap (stale) still-active connections -- and so the server could end up with many more connections than your pool limit.
But then, maybe I'm just missing something. Because as far as I can see, neither of the mysql adapters are any safer from a reaper in a different thread than the SELECTive postgres version.
@matthewd I am beginning to think that both solutions are ultimately necessary:
#active?
is fundamentally a bad way of determining if any network connection is still alive because there is still a race condition between when #active?
is called and when the connection is then used (this does not apply to the reaper's use of #active?
but to other uses such as in #verify!
). This can only be solved by solution 1.
Also, if I understand the current reaper code correctly, it will only reap a connection if it is currently in use, it has been longer than dead_connection_timeout since it was checked out, and #active?
reports that the connection is dead. That means that it will not reap (close and remove from the connection pool) a dangling connection from a dead thread until the OS, the network, or the remote server kills the connection --- which can take a minute or more of inactivity on the connection. However, this may be the only way to determine whether the connection is dangling without pulling a connection out from under a long-lived thread (as may be the case with sidekiq jobs). Which sounds like we also need solution 2, in order to allow the reaper to actually detect dead connections.
All this stems from the fact that the connection is not self-repairing, yes? Dalli and Redis client connections are pooled with my connection_pool gem all the time. connection_pool documents that all connections should be self-repairing so that the pool doesn't need to worry about reaping or repairing.
On Dec 15, 2013, at 8:50, "William (B.J.) Snow Orvis" notifications@github.com wrote:
@matthewd I am beginning to think that both solutions are ultimately necessary:
active? is fundamentally a bad way of determining if any network connection is still alive because there is still a race condition between when #active? is called and when the connection is then used (this does not apply to the reaper's use of #active? but to other uses such as in #verify!). This can only be solved by solution 1.
Also, if I understand the current reaper code correctly, it will only reap a connection if it is currently in use, it has been longer than dead_connection_timeout since it was checked out, and #active? reports that the connection is dead. That means that it will not reap (close and remove from the connection pool) a dangling connection from a dead thread until the OS, the network, or the remote server kills the connection --- which can take a minute or more of inactivity on the connection. However, this may be the only way to determine whether the connection is dangling without pulling a connection out from under a long-lived thread (as may be the case with sidekiq jobs). Which sounds like we also need solution 2, in order to allow the reaper to actually detect dead connections.
— Reply to this email directly or view it on GitHub.
@mperham not just self repairing. If a thread checks out a connection, then dies before checking the connection back in, what happens? This is why the reaper checks on "long lived" connections.
@tenderlove If a thread somehow does not check in connections, they'd quickly exhaust. What's wrong with that? connection_pool
only documents block usage with_connection {|conn| ... }
to ensure it's automatically checked back in. Which AR API can leak connections? You talking about conn = Model.connection
?
@mperham yes, that will leak a connection if done inside a Thread.new { } (and you don't specifically check it back in)
Yep, that form of API should be deprecated in preference for with_connection
in Rails 5 IMO. I know that doesn't help today though. :-/
@aetherknight Yeah, I could definitely get behind an idea of arranging automatic reconnection iff you're performing a SELECT (or BEGIN), and there isn't [already] supposed to be a transaction open.
If you're trying to send an INSERT, say, you'd want to be very sure the exception meant "driver failed to send the query", and not "driver lost connection while waiting to hear what happened", before silently retrying.
@tenderlove can/does the pool keep track of which thread checked out the connection? We can determine pretty easily that the thread's gone away... though whether you trust that it's safe to return the connection to the pool, vs just tell it to close, is another matter.
Or is there a worry about code that goes out of its way to check out a connection in one thread, hand it to another, then have the first thread die?
@matthewd The pool does keep track of which thread checks out the connection, but you shouldn't rely on it (it uses the object id of the thread). We used to loop through every thread in the system, check it's status, then check the pool, but I'd prefer a solution that doesn't depend on doing that. We could probably stuff the thread in a hash or something when it checks out, then just have the reaper check connections for those threads.
We shouldn't worry about one thread passing the connection to another. I don't think we support that (I think it's documented too).
Also, we should probably split this in to two issues. One for self-repairing connections, and one for dealing with long lived and possibly dead threads. I tried adding code for self-repairing connections, but TBH I can't repro this issue and I don't want to rescue from just any old PG exception.
@tenderlove I did a thing: https://github.com/matthewd/rails/compare/master...pg_verify
Does that look at all sane?
As far as I can see, those two tests both need to pass for the reaper and checkout_and_verify to both function correctly... and no less-invasive solution comes to mind.
It looks sane, but I think you can make the locks more granular, and possibly remove some locks. For example, the locks should probably be inside the "log" calls.
The connection shouldn't be shared among threads (except the reaper). Does the reaper impact the statement cache? I'm not sure we need that lock.
Shouldn't there be a lock in "active?"
Looks good!
Aaron Patterson http://tenderlovemaking.com/ I'm on an iPhone so I apologize for top posting.
On Dec 18, 2013, at 7:39 AM, Matthew Draper notifications@github.com wrote:
@tenderlove I did a thing: matthewd@master...pg_verify
Does that look at all sane?
As far as I can see, those two tests both need to pass for the reaper and checkout_and_verify to both function correctly... and no less-invasive solution comes to mind.
\ Reply to this email directly or view it on GitHub.
I wrapped the logging so we weren't claiming to be running a query when we were actually waiting, but that's easy to swap if you prefer.
For the connection management methods, I made the locks wider out of principle -- it's implausible to ever affect a SELECT 1, but for purity, nothing ought to see a half-configured connection.
Deleting from the statement cache sends a DEALLOCATE to the raw connection (as can writing, because eviction).
Yes, I managed to lose the most important lock while splitting the commits. How about I fix that? :blush:
Anyway, if you're happy this is a reasonable option, I'll turn it into a pull request shortly, and we can continue discussion over there.
@tenderlove how have you attempted to repro this? I'd try setting up something that closes the connection on the DB side some time after a thread has established the connection. We're using PGBouncer between Rails 4 and PG, and are getting the PG::UnableToSend
exceptions regularly in threaded workers that sit inactive for longer than PGBouncer's timeout. I'm sure the OP's HAProxy setup provides a similar situation.
Actually, there's an even easier way to reproduce in Rails 4.0.2:
2.0.0-p353> ActiveRecord::Base.connection.execute 'SELECT 1'
=> #<PG::Result ...
# kill and restart Postgres
2.0.0-p353> ActiveRecord::Base.connection.active?
=> true
2.0.0-p353> ActiveRecord::Base.connection.execute 'SELECT 1'
PG::UnableToSend: : SELECT 1
I'm kind of surprised the lack of attention (from users) this issue has gotten. Seems like this could be a serious issue for a lot of users, no?
No, most people don't put a proxy between their app server and db server which kills idle connections.
On Jan 3, 2014, at 10:14, Tanel Suurhans notifications@github.com wrote:
I'm kind of surprised the lack of attention (from users) this issue has gotten. Seems like this could be a serious issue for a lot of users, no?
— Reply to this email directly or view it on GitHub.
But, if i understand correctly, this issue is present when you restart your database server for example. So right now with Rails 3.2 you don't need to restart your application (which is expensive), with this issue in Rails 4 you cannot do the same. This ,to me, seems the most obvious issue that most people should run into.
@mperham most people needing to scale ActiveRecord + Postgres do exactly that. Unlike MySQL, Postgres can't handle more than a few hundred simultaneous connections (process per connection). If you have any sort of traffic, a proxy to multiplex these connections is crucial and recommended by Postgres admins everywhere (including Heroku).
I imagine the reason this issue hasn't been complained about is because most of the teams using ActiveRecord + Postgres at scale haven't moved to 4.0 yet.
Ok, we run MySQL and haven't restarted our database in the last year so I'm obviously pretty far from this issue. Thanks for the info, very informative.
On Fri, Jan 3, 2014 at 10:59 AM, Evan Owen notifications@github.com wrote:
@mperham https://github.com/mperham most people needing to scale ActiveRecord + Postgres do exactly that. Unlike MySQL, Postgres can't handle more than a few hundred simultaneous connections (process per connection). If you have any sort of traffic, a proxy to multiplex these connections is crucial and recommended by Postgres admins everywhere (including Heroku).
I imagine the reason this issue hasn't been complained about is because most of the teams using ActiveRecord + Postgres at scale haven't moved to 4.0 yet.
— Reply to this email directly or view it on GitHubhttps://github.com/rails/rails/issues/12867#issuecomment-31544727 .
@tanelsuurhans Indeed; error fallout from a DB restart (for a minor version update) is precisely what brought me here.
Technically, you don't need to restart your application... you'll only get one error per newly-dead connection, so depending on your server setup, you could contrive a single destined-to-fail request to each worker process, or something.
Similarly, for most of my apps (largely because the DB restart was in the middle of the night), the errors were actually all 'absorbed' by our normal site monitoring, and everything had fixed itself before a real user came along.
@kainosnoema For a number of (fairly common) scaling profiles, a connection proxy is absolutely The Right Thing. But I'm not so sure that most people do the "which kills idle [client] connections" part.
Having same issue with mysql and postgres since upgrading to rails 4.x ( tested 4.01 and 4.02 ) it always time-out with:
could not obtain a database connection within 5.000 seconds (waited 5.000 seconds)
Just plain OSX + rails + mysql or postgres no fancy server setups. In development mode using Puma. No complex queries, reloading any page a few times makes this occur.
@rubytastic It sounds like you need to increase your database connection pool size. Does not sound like it's related to this issue.
@tanelsuurhans I have pool set to 16, this makes no difference, 100 or 16 or 5 , it always times out in a certain few reloads.
@rubytastic did you ever find a solution to your problem? I am having a very similar issue with Rails 4 and Postgresql
@BobaFaux No, Although I havent looked at this past the last month the problem seems persist. With Mysql And Postgress. I find it strange that no one has brought this issue to attention or have had similar issues.
We are experiencing this issue frequently (multiple times per day, in a low throughput staging environment) without any sort of proxy between the two. Using a dedicated postgresql server which services multiple ruby applications, and a separate application server running multiple ruby applications connecting to it. We only get exceptions for this on rails 4 (currently 4.0.2), our rails 3 applications do not appear to be affected. The affected app (and other apps too), run under passenger (which forks), and also have background workers under sidekiq (multi-thread).
The connection pool size being more highly contested due to more applications may be exacerbating this issue for us, but it's a problem that rails 3 is able to handle this situation gracefully, but rails 4 is not.
Sadly I don't have any sort of solution, but I wanted to chime in that I'm also having this issue when using Sidekiq workers.
I'm having a similar problem with rufus-scheduler on passenger + rials 4.0.2
Traying to use reaper_frequency:10 but at moment it only delay the problem
This should seriously be fixed / looked at by one of the core members? Can we Raise some attention to this issue? @aetherknight @tenderlove
@pyro2927 I believe this issue is not caused by using sidekiq in your product. It happens every time
@rubytastic, I realize it's not caused by Sidekiq, but switching to using Resque worked around the issue.
I'm running into this problem as well on rails 4.0.2 and pg 9.3
I have a long running request (rarely hit) that I think is causing other requests to queue, and eventually time out in some cases. I'm assuming those abrupt timeouts are leaving postgres connections open - and eventually leading to the issue.
I'd like to get a quick fix into my app while I take a stab at a real solution.
Sorry if this a naive question, @mperham, but if monkey patching the old behavior isn't sufficient for sidekiq users, how did sidekiq not experience an issue in versions of rails 3 where this problem didn't exist? I was hoping to do that in the meantime. I only ask as it would be good information to know when trying to come up with a fix.
Thanks for all the excellent info on this issue :)
TL;DR: Rails 4's PostgreSQLAdapter does not properly check whether a connection has died or not because
connect_poll
does not detect if the underlying TCP socket has closed.Rails 4.0.0 and 4.0.1 do not properly detect whether a PostgreSQL connection still alive or not before using it. A previously good connection that has been disconnected by the remote server or by an intermediate proxy will not be detected by rails, which will result in queries failing. An example error that we ran into:
This error began occurring when we began using Rails 4.0.0 and 4.0.1 instead of Rails 3.2.15. Our database connections were being terminated by an HAProxy in our setup that terminates idle TCP connections, which Rails 3.2 had no issue with --- it would detect the dead connection and establish a new connection without issue. Rails 4 on the other hand does not. A subsequent request cycle caused a new connection to be created, which would then work as long as that connection was not killed as well.
It turns out, Rails already has a unit test to verify that a dead connection is properly detected, but it is "skipped" by default because it requires a human to manually restart the PostgreSQL database (although there are ways this could be automated). When ActiveRecord::PostgresqlConnectionTest#test_reconnection_after_actual_disconnection_with_verify is run in Rails 4.0.1:
It fails, along with the tests performed after it.
The above error, as well as the failed test, are caused by PostgreSQLAdapter not properly discovering when the underlying TCP connection closes: libpq ran into an EOF, which occurs if the underlying TCP socket has closed.
When the connection pool determines whether a given database connection is still valid, it calls PostgreSQLAdapter#active?:
Ruby-pg's
PGConn#connect_poll
is a thin wrapper aroundPQconnectPoll()
from libpq. However,PQconnectPoll()
is meant for setting up a Postgres client connection in a non-blocking manner after the connection's socket is ready for reading/writing (as determined byselect()
orpoll()
) --- it does not actually perform any health checks on the connection. In fact, if the last known state of the connection is good, it immediately returns:(From src/interface/libpq/fe-connect.c in the postgresql source)
This means that
connect_poll
is not sufficient for detecting an inactive connection that has been closed by the other end. The Rails 3.2.15 version ofactive?
successfully detects the lost TCP connection by actually performing a database query, which exercises the connection and would fail if the connection were terminated:This code change was introduced in https://github.com/rails/rails/commit/34c7e73c1def1312e59ef1f334586ff2f668246e
To Fix
There is no easy way to determine if a TCP socket is dead other than by trying to use the connection (as was done in Rails 3.2.15 and earlier, and is also still used by the mysql adapters).
Traditionally, a remotely closed TCP connection can be detected by calling
recv
on the socket. If recv returns a string of length 0, then the remote side has closed the connection. If the socket is non-blocking but has not been closed, it will return an EAGAIN error. However, this approach does not work if there is data buffered on the TCP stream. I attempted to come up with a fix that usesrecv_nonblock
to peek ahead in the stream:However, if the server closes the connection, as it does during
test_reconnection_after_actual_disconnection_with_verify
, it sends a final message to the client that won't have been consumed yet whensocket_alive?
runs:There is actually a deeper problem here involving a time-of-check-to-time-of-use (TOCTTOU) weakness where the connection might die after checking the connection but before trying to use it. A better solution would be to detect the dead connection while trying to use it, and to then attempt to reestablish the connection when it dies. ActiveRecord should only let the error bubble up if the reconnect attempt fails.