Rails 4.0.0 and 4.0.1 PostgreSQLAdapter does not properly check whether a connection is active

aetherknight commented 10 years ago

TL;DR: Rails 4's PostgreSQLAdapter does not properly check whether a connection has died or not because connect_poll does not detect if the underlying TCP socket has closed.

Rails 4.0.0 and 4.0.1 do not properly detect whether a PostgreSQL connection still alive or not before using it. A previously good connection that has been disconnected by the remote server or by an intermediate proxy will not be detected by rails, which will result in queries failing. An example error that we ran into:

    PG::UnableToSend: SSL SYSCALL error: EOF detected

    activerecord-4.0.1/lib/active_record/connection_adapters/postgresql_adapter.rb:774→ exec
    activerecord-4.0.1/lib/active_record/connection_adapters/postgresql_adapter.rb:774→ exec_no_cache
    activerecord-4.0.1/lib/active_record/connection_adapters/postgresql/database_statements.rb:138→ block in exec_query
    activerecord-4.0.1/lib/active_record/connection_adapters/abstract_adapter.rb:435→ block in log
    activesupport-4.0.1/lib/active_support/notifications/instrumenter.rb:20→ instrument
    activerecord-4.0.1/lib/active_record/connection_adapters/abstract_adapter.rb:430→ log
    activerecord-4.0.1/lib/active_record/connection_adapters/postgresql/database_statements.rb:137→ exec_query
    activerecord-4.0.1/lib/active_record/connection_adapters/postgresql_adapter.rb:921→ column_definitions
    activerecord-4.0.1/lib/active_record/connection_adapters/postgresql/schema_statements.rb:174→ columns
    activerecord-4.0.1/lib/active_record/connection_adapters/schema_cache.rb:114→ block in prepare_default_proc
    activerecord-4.0.1/lib/active_record/connection_adapters/schema_cache.rb:56→ yield
    activerecord-4.0.1/lib/active_record/connection_adapters/schema_cache.rb:56→ columns
    activerecord-4.0.1/lib/active_record/model_schema.rb:208→ columns
    activerecord-4.0.1/lib/active_record/model_schema.rb:217→ columns_hash
    activerecord-4.0.1/lib/active_record/relation/delegation.rb:14→ columns_hash
    activerecord-4.0.1/lib/active_record/relation/finder_methods.rb:278→ find_one
    activerecord-4.0.1/lib/active_record/relation/finder_methods.rb:268→ find_with_ids
    activerecord-4.0.1/lib/active_record/relation/finder_methods.rb:35→ find
    activerecord-deprecated_finders-1.0.3/lib/active_record/deprecated_finders/relation.rb:122→ find
    activerecord-4.0.1/lib/active_record/relation.rb:334→ update
    activerecord-4.0.1/lib/active_record/querying.rb:7→ update
    (our controller method)
    ...

This error began occurring when we began using Rails 4.0.0 and 4.0.1 instead of Rails 3.2.15. Our database connections were being terminated by an HAProxy in our setup that terminates idle TCP connections, which Rails 3.2 had no issue with --- it would detect the dead connection and establish a new connection without issue. Rails 4 on the other hand does not. A subsequent request cycle caused a new connection to be created, which would then work as long as that connection was not killed as well.

It turns out, Rails already has a unit test to verify that a dead connection is properly detected, but it is "skipped" by default because it requires a human to manually restart the PostgreSQL database (although there are ways this could be automated). When ActiveRecord::PostgresqlConnectionTest#test_reconnection_after_actual_disconnection_with_verify is run in Rails 4.0.1:

cd activerecord
$EDITOR test/config.yml # Set with_manual_interventions to true instead of false
ARCONN=postgresql ruby -Itest test/cases/adapters/postgresql/connection_test.rb

It fails, along with the tests performed after it.

The above error, as well as the failed test, are caused by PostgreSQLAdapter not properly discovering when the underlying TCP connection closes: libpq ran into an EOF, which occurs if the underlying TCP socket has closed.

When the connection pool determines whether a given database connection is still valid, it calls PostgreSQLAdapter#active?:

      # Is this connection alive and ready for queries?
      def active?
        @connection.connect_poll != PG::PGRES_POLLING_FAILED
      rescue PGError
        false
      end

Ruby-pg's PGConn#connect_poll is a thin wrapper around PQconnectPoll() from libpq. However, PQconnectPoll() is meant for setting up a Postgres client connection in a non-blocking manner after the connection's socket is ready for reading/writing (as determined by select() or poll()) --- it does not actually perform any health checks on the connection. In fact, if the last known state of the connection is good, it immediately returns:

(From src/interface/libpq/fe-connect.c in the postgresql source)

PostgresPollingStatusType
PQconnectPoll(PGconn *conn)
{
    PGresult   *res;
    char        sebuf[256];
    int         optval;

    if (conn == NULL)
        return PGRES_POLLING_FAILED;

    /* Get the new data */
    switch (conn->status)
    {
            /*
             * We really shouldn't have been polled in these two cases, but we
             * can handle it.
             */
        case CONNECTION_BAD:
            return PGRES_POLLING_FAILED;
        case CONNECTION_OK:
            return PGRES_POLLING_OK;

...

This means that connect_poll is not sufficient for detecting an inactive connection that has been closed by the other end. The Rails 3.2.15 version of active? successfully detects the lost TCP connection by actually performing a database query, which exercises the connection and would fail if the connection were terminated:

      # Is this connection alive and ready for queries?
      def active?
        @connection.query 'SELECT 1'
        true
      rescue PGError
        false
      end

This code change was introduced in https://github.com/rails/rails/commit/34c7e73c1def1312e59ef1f334586ff2f668246e

To Fix

There is no easy way to determine if a TCP socket is dead other than by trying to use the connection (as was done in Rails 3.2.15 and earlier, and is also still used by the mysql adapters).

Traditionally, a remotely closed TCP connection can be detected by calling recv on the socket. If recv returns a string of length 0, then the remote side has closed the connection. If the socket is non-blocking but has not been closed, it will return an EAGAIN error. However, this approach does not work if there is data buffered on the TCP stream. I attempted to come up with a fix that uses recv_nonblock to peek ahead in the stream:

      def socket_alive?
        socket = Socket.for_fd(@connection.socket)
        socket.autoclose = false
        socket.recv_nonblock(1, Socket::MSG_PEEK) != ''
      rescue Errno::EAGAIN
        true
      end

      # Is this connection alive and ready for queries?
      def active?
        socket_alive? && @connection.connect_poll != PG::PGRES_POLLING_FAILED
      rescue PGError
        false
      end

However, if the server closes the connection, as it does during test_reconnection_after_actual_disconnection_with_verify, it sends a final message to the client that won't have been consumed yet when socket_alive? runs:

[3] pry(#<ActiveRecord::ConnectionAdapters::PostgreSQLAdapter>)> socket.recv_nonblock(1000, Socket::MSG_PEEK)
=> "E\x00\x00\x00mSFATAL\x00C57P01\x00Mterminating connection due to administrator command\x00Fpostgres.c\x00L2855\x00RProcessInterrupts\x00\x00"

There is actually a deeper problem here involving a time-of-check-to-time-of-use (TOCTTOU) weakness where the connection might die after checking the connection but before trying to use it. A better solution would be to detect the dead connection while trying to use it, and to then attempt to reestablish the connection when it dies. ActiveRecord should only let the error bubble up if the reconnect attempt fails.

aetherknight commented 10 years ago

Hi @tenderlove, Your change to #active? in the mentioned commit (https://github.com/rails/rails/commit/34c7e73c1def1312e59ef1f334586ff2f668246e) 8 months ago has been causing us some major pain with recovering from dead database connections. What sort of trouble did a select 1 query cause during the connection pool's reaping?

tenderlove commented 10 years ago

Reaping happens in a background thread. We can't send any data over the socket because there are no locks around the file descriptor (that I know of).

Aaron Patterson http://tenderlovemaking.com/ I'm on an iPhone so I apologize for top posting.

On Nov 12, 2013, at 2:59 PM, "William (B.J.) Snow Orvis" notifications@github.com wrote:

Hi @tenderlove, Your change to #active? in the mentioned commit (34c7e73) 8 months ago has been causing us some major pain with recovering from dead database connections. What sort of trouble did a select 1 query cause during the connection pool's reaping?

\ Reply to this email directly or view it on GitHub.

jonleighton commented 10 years ago

Awesome bug report @aetherknight. I'm also experiencing a variant of this problem in my application - the StatementPool#connection_active? method is also broken in a similar way.

jonleighton commented 10 years ago

Here's the related error which can occur when calling establish_connection:

PG::UnableToSend: SSL SYSCALL error: Connection timed out 
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/postgresql_adapter.rb:517:in `exec`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/postgresql_adapter.rb:517:in `dealloc`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/postgresql_adapter.rb:500:in `block in clear`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/postgresql_adapter.rb:499:in `each_value`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/postgresql_adapter.rb:499:in `clear`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/postgresql_adapter.rb:563:in `clear_cache!`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/abstract_adapter.rb:339:in `disconnect!`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/postgresql_adapter.rb:588:in `disconnect!`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/abstract/connection_pool.rb:310:in `block (2 levels) in disconnect!`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/abstract/connection_pool.rb:308:in `each`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/abstract/connection_pool.rb:308:in `block in disconnect!`
/usr/local/rbenv/versions/2.0.0-p247/lib/ruby/2.0.0/monitor.rb:211:in `mon_synchronize`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/abstract/connection_pool.rb:306:in `disconnect!`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/abstract/connection_pool.rb:564:in `remove_connection`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_handling.rb:88:in `remove_connection`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_handling.rb:45:in `establish_connection`

aetherknight commented 10 years ago

So I think there may be a few possible approaches to fixing this:

Implement correct TCP connection handling. When a connection abnormally closes during use, AR should attempt to reconnect automatically. The goal here is to also eliminate the TOCTTOU issue that exists between checking whether a connection is still good and then trying to do work on it, and it would only cause an error on a closed connection if it can't reestablish the connection, or if there's an additional error during the reestablish attempt. This approach would make the #active? and #connection_active? methods not particularly meaningful anymore (since you discover a dead connection by attempting to do work on it), and reaping stale connections in the ConnectionManager would need to only be based on connections that weren't checked back in (the current logic only reaps a connection that is in use, is stale, and is no longer active).
Wrap uses of the PG connection with a monitor/lock within the PostgreSQLAdapter to prevent 2+ threads (the worker/request handling thread and the reaper thread) from doing work on the PG connection object. #active? and #connection_active? then would probably need to handle timeouts on the lock, or their caller would need to handle that case.

aetherknight commented 10 years ago

For the later approach, #active? and #connection_active? would then do work on the connection (eg, go back to select 1 within the synchronize/monitor/lock code.

jonleighton commented 10 years ago

Yep. The former definitely makes more sense to me.

tenderlove commented 10 years ago

@aetherknight I agree, we should go with solution #1. Seems the mysql adapter has a setting to automatically do this, but the pg adapter does not. We'll have to rescue exceptions wherever we call methods on the connection, then call PGConn#reset. Though we should probably do a loop and PGConn#reset_set + PGConn#reset_poll so we don't block the process.

bradleypriest commented 10 years ago

Will another stacktrace help out here? I'm getting intermittent ruby segfaults with a Capybara test suite https://gist.github.com/bradleypriest/7660331 in the same place

rubytastic commented 10 years ago

Same issue, using sidekiq but not sure that is the main cause of problems. Any fix already pending? My app totally broke and gives the ActiveRecord::ConnectionTimeoutError - could not obtain a database connection within ... in about 1 or 2 page reloads.

rubytastic commented 10 years ago

Is there a temporarily fix for this issue? Its preventing my app to run at all locally or in production. Tried increasing pg pool, timeout all fails this problem comes back every time. Connection pool is now at 100 which still doesn't help. There are only 10 active pg connections. How is the rest of the world running there Rails apps on PG when this issue persist?

Stacktrace: https://gist.github.com/rubytastic/7786482

mperham commented 10 years ago

@rubytastic You could determine why your db connections are dying so quickly. I don't have any pgsql experience but I know MySQL has a timeout setting where the server will kill any clients that haven't used the socket in that time period.

aetherknight commented 10 years ago

For the short term, we monkeypatched The PostgreSQLAdapter to use the older version of #active? that performed a select 1 query, since we aren't using Ruby threading to handle multiple requests, nor are we using the reaper in its own thread.

On Wed, Dec 4, 2013 at 1:37 PM, Mike Perham notifications@github.comwrote:

@rubytastic https://github.com/rubytastic You could determine why your db connections are dying so quickly. I don't have any pgsql experience but I know MySQL has a timeout setting where the server will kill any clients that haven't used the socket in that time period.

— Reply to this email directly or view it on GitHubhttps://github.com/rails/rails/issues/12867#issuecomment-29848045 .

mperham commented 10 years ago

@aetherknight Unfortunately that's not an option for Sidekiq users.

aetherknight commented 10 years ago

@mperham Ah, good point.

rubytastic commented 10 years ago

@aetherknight Total PG newbie just switched from Mysql. Understanding correctly there is no solid solution nor will it be implemented in the short terms. Than this is a serious issue for PG and Sidekiq(?) combination, I wonder why so little people have this issue.

So switching back to mysql, for now would be the best option to go or is there a monkey patch that could be implemented to fix the problem temporarily

matthewd commented 10 years ago

@tenderlove by my reading, solution 2 is necessary.

If you can't tell whether a connection is #active?, you either never reap it, or you regularly reap (stale) still-active connections -- and so the server could end up with many more connections than your pool limit.

But then, maybe I'm just missing something. Because as far as I can see, neither of the mysql adapters are any safer from a reaper in a different thread than the SELECTive postgres version.

aetherknight commented 10 years ago

@matthewd I am beginning to think that both solutions are ultimately necessary:

#active? is fundamentally a bad way of determining if any network connection is still alive because there is still a race condition between when #active? is called and when the connection is then used (this does not apply to the reaper's use of #active? but to other uses such as in #verify!). This can only be solved by solution 1.

Also, if I understand the current reaper code correctly, it will only reap a connection if it is currently in use, it has been longer than dead_connection_timeout since it was checked out, and #active? reports that the connection is dead. That means that it will not reap (close and remove from the connection pool) a dangling connection from a dead thread until the OS, the network, or the remote server kills the connection --- which can take a minute or more of inactivity on the connection. However, this may be the only way to determine whether the connection is dangling without pulling a connection out from under a long-lived thread (as may be the case with sidekiq jobs). Which sounds like we also need solution 2, in order to allow the reaper to actually detect dead connections.

mperham commented 10 years ago

All this stems from the fact that the connection is not self-repairing, yes? Dalli and Redis client connections are pooled with my connection_pool gem all the time. connection_pool documents that all connections should be self-repairing so that the pool doesn't need to worry about reaping or repairing.

On Dec 15, 2013, at 8:50, "William (B.J.) Snow Orvis" notifications@github.com wrote:

@matthewd I am beginning to think that both solutions are ultimately necessary:

active? is fundamentally a bad way of determining if any network connection is still alive because there is still a race condition between when #active? is called and when the connection is then used (this does not apply to the reaper's use of #active? but to other uses such as in #verify!). This can only be solved by solution 1.

Also, if I understand the current reaper code correctly, it will only reap a connection if it is currently in use, it has been longer than dead_connection_timeout since it was checked out, and #active? reports that the connection is dead. That means that it will not reap (close and remove from the connection pool) a dangling connection from a dead thread until the OS, the network, or the remote server kills the connection --- which can take a minute or more of inactivity on the connection. However, this may be the only way to determine whether the connection is dangling without pulling a connection out from under a long-lived thread (as may be the case with sidekiq jobs). Which sounds like we also need solution 2, in order to allow the reaper to actually detect dead connections.

— Reply to this email directly or view it on GitHub.

tenderlove commented 10 years ago

@mperham not just self repairing. If a thread checks out a connection, then dies before checking the connection back in, what happens? This is why the reaper checks on "long lived" connections.

mperham commented 10 years ago

@tenderlove If a thread somehow does not check in connections, they'd quickly exhaust. What's wrong with that? connection_pool only documents block usage with_connection {|conn| ... } to ensure it's automatically checked back in. Which AR API can leak connections? You talking about conn = Model.connection?

tenderlove commented 10 years ago

@mperham yes, that will leak a connection if done inside a Thread.new { } (and you don't specifically check it back in)

mperham commented 10 years ago

Yep, that form of API should be deprecated in preference for with_connection in Rails 5 IMO. I know that doesn't help today though. :-/

matthewd commented 10 years ago

@aetherknight Yeah, I could definitely get behind an idea of arranging automatic reconnection iff you're performing a SELECT (or BEGIN), and there isn't [already] supposed to be a transaction open.

If you're trying to send an INSERT, say, you'd want to be very sure the exception meant "driver failed to send the query", and not "driver lost connection while waiting to hear what happened", before silently retrying.

matthewd commented 10 years ago

@tenderlove can/does the pool keep track of which thread checked out the connection? We can determine pretty easily that the thread's gone away... though whether you trust that it's safe to return the connection to the pool, vs just tell it to close, is another matter.

Or is there a worry about code that goes out of its way to check out a connection in one thread, hand it to another, then have the first thread die?

tenderlove commented 10 years ago

@matthewd The pool does keep track of which thread checks out the connection, but you shouldn't rely on it (it uses the object id of the thread). We used to loop through every thread in the system, check it's status, then check the pool, but I'd prefer a solution that doesn't depend on doing that. We could probably stuff the thread in a hash or something when it checks out, then just have the reaper check connections for those threads.

We shouldn't worry about one thread passing the connection to another. I don't think we support that (I think it's documented too).

Also, we should probably split this in to two issues. One for self-repairing connections, and one for dealing with long lived and possibly dead threads. I tried adding code for self-repairing connections, but TBH I can't repro this issue and I don't want to rescue from just any old PG exception.

matthewd commented 10 years ago

@tenderlove I did a thing: https://github.com/matthewd/rails/compare/master...pg_verify

Does that look at all sane?

As far as I can see, those two tests both need to pass for the reaper and checkout_and_verify to both function correctly... and no less-invasive solution comes to mind.

tenderlove commented 10 years ago

It looks sane, but I think you can make the locks more granular, and possibly remove some locks. For example, the locks should probably be inside the "log" calls.

The connection shouldn't be shared among threads (except the reaper). Does the reaper impact the statement cache? I'm not sure we need that lock.

Shouldn't there be a lock in "active?"

Looks good!

Aaron Patterson http://tenderlovemaking.com/ I'm on an iPhone so I apologize for top posting.

On Dec 18, 2013, at 7:39 AM, Matthew Draper notifications@github.com wrote:

@tenderlove I did a thing: matthewd@master...pg_verify

Does that look at all sane?

As far as I can see, those two tests both need to pass for the reaper and checkout_and_verify to both function correctly... and no less-invasive solution comes to mind.

\ Reply to this email directly or view it on GitHub.

matthewd commented 10 years ago

I wrapped the logging so we weren't claiming to be running a query when we were actually waiting, but that's easy to swap if you prefer.

For the connection management methods, I made the locks wider out of principle -- it's implausible to ever affect a SELECT 1, but for purity, nothing ought to see a half-configured connection.

Deleting from the statement cache sends a DEALLOCATE to the raw connection (as can writing, because eviction).

Yes, I managed to lose the most important lock while splitting the commits. How about I fix that? :blush:

Anyway, if you're happy this is a reasonable option, I'll turn it into a pull request shortly, and we can continue discussion over there.

kainosnoema commented 10 years ago

@tenderlove how have you attempted to repro this? I'd try setting up something that closes the connection on the DB side some time after a thread has established the connection. We're using PGBouncer between Rails 4 and PG, and are getting the PG::UnableToSend exceptions regularly in threaded workers that sit inactive for longer than PGBouncer's timeout. I'm sure the OP's HAProxy setup provides a similar situation.

kainosnoema commented 10 years ago

Actually, there's an even easier way to reproduce in Rails 4.0.2:

2.0.0-p353> ActiveRecord::Base.connection.execute 'SELECT 1'
 => #<PG::Result ...

# kill and restart Postgres

2.0.0-p353> ActiveRecord::Base.connection.active?
 => true
2.0.0-p353> ActiveRecord::Base.connection.execute 'SELECT 1'
PG::UnableToSend: : SELECT 1

tanelsuurhans commented 10 years ago

I'm kind of surprised the lack of attention (from users) this issue has gotten. Seems like this could be a serious issue for a lot of users, no?

mperham commented 10 years ago

No, most people don't put a proxy between their app server and db server which kills idle connections.

On Jan 3, 2014, at 10:14, Tanel Suurhans notifications@github.com wrote:

I'm kind of surprised the lack of attention (from users) this issue has gotten. Seems like this could be a serious issue for a lot of users, no?

— Reply to this email directly or view it on GitHub.

tanelsuurhans commented 10 years ago

But, if i understand correctly, this issue is present when you restart your database server for example. So right now with Rails 3.2 you don't need to restart your application (which is expensive), with this issue in Rails 4 you cannot do the same. This ,to me, seems the most obvious issue that most people should run into.

kainosnoema commented 10 years ago

@mperham most people needing to scale ActiveRecord + Postgres do exactly that. Unlike MySQL, Postgres can't handle more than a few hundred simultaneous connections (process per connection). If you have any sort of traffic, a proxy to multiplex these connections is crucial and recommended by Postgres admins everywhere (including Heroku).

I imagine the reason this issue hasn't been complained about is because most of the teams using ActiveRecord + Postgres at scale haven't moved to 4.0 yet.

mperham commented 10 years ago

Ok, we run MySQL and haven't restarted our database in the last year so I'm obviously pretty far from this issue. Thanks for the info, very informative.

On Fri, Jan 3, 2014 at 10:59 AM, Evan Owen notifications@github.com wrote:

@mperham https://github.com/mperham most people needing to scale ActiveRecord + Postgres do exactly that. Unlike MySQL, Postgres can't handle more than a few hundred simultaneous connections (process per connection). If you have any sort of traffic, a proxy to multiplex these connections is crucial and recommended by Postgres admins everywhere (including Heroku).

I imagine the reason this issue hasn't been complained about is because most of the teams using ActiveRecord + Postgres at scale haven't moved to 4.0 yet.

— Reply to this email directly or view it on GitHubhttps://github.com/rails/rails/issues/12867#issuecomment-31544727 .

matthewd commented 10 years ago

@tanelsuurhans Indeed; error fallout from a DB restart (for a minor version update) is precisely what brought me here.

Technically, you don't need to restart your application... you'll only get one error per newly-dead connection, so depending on your server setup, you could contrive a single destined-to-fail request to each worker process, or something.

Similarly, for most of my apps (largely because the DB restart was in the middle of the night), the errors were actually all 'absorbed' by our normal site monitoring, and everything had fixed itself before a real user came along.

matthewd commented 10 years ago

@kainosnoema For a number of (fairly common) scaling profiles, a connection proxy is absolutely The Right Thing. But I'm not so sure that most people do the "which kills idle [client] connections" part.

rubytastic commented 10 years ago

Having same issue with mysql and postgres since upgrading to rails 4.x ( tested 4.01 and 4.02 ) it always time-out with:

could not obtain a database connection within 5.000 seconds (waited 5.000 seconds)

Just plain OSX + rails + mysql or postgres no fancy server setups. In development mode using Puma. No complex queries, reloading any page a few times makes this occur.

tanelsuurhans commented 10 years ago

@rubytastic It sounds like you need to increase your database connection pool size. Does not sound like it's related to this issue.

rubytastic commented 10 years ago

@tanelsuurhans I have pool set to 16, this makes no difference, 100 or 16 or 5 , it always times out in a certain few reloads.

JonRCahill commented 10 years ago

@rubytastic did you ever find a solution to your problem? I am having a very similar issue with Rails 4 and Postgresql

rubytastic commented 10 years ago

@BobaFaux No, Although I havent looked at this past the last month the problem seems persist. With Mysql And Postgress. I find it strange that no one has brought this issue to attention or have had similar issues.

jeremyolliver commented 10 years ago

We are experiencing this issue frequently (multiple times per day, in a low throughput staging environment) without any sort of proxy between the two. Using a dedicated postgresql server which services multiple ruby applications, and a separate application server running multiple ruby applications connecting to it. We only get exceptions for this on rails 4 (currently 4.0.2), our rails 3 applications do not appear to be affected. The affected app (and other apps too), run under passenger (which forks), and also have background workers under sidekiq (multi-thread).

The connection pool size being more highly contested due to more applications may be exacerbating this issue for us, but it's a problem that rails 3 is able to handle this situation gracefully, but rails 4 is not.

pyro2927 commented 10 years ago

Sadly I don't have any sort of solution, but I wanted to chime in that I'm also having this issue when using Sidekiq workers.

jalberto commented 10 years ago

I'm having a similar problem with rufus-scheduler on passenger + rials 4.0.2

Traying to use reaper_frequency:10 but at moment it only delay the problem

rubytastic commented 10 years ago

This should seriously be fixed / looked at by one of the core members? Can we Raise some attention to this issue? @aetherknight @tenderlove

rubytastic commented 10 years ago

@pyro2927 I believe this issue is not caused by using sidekiq in your product. It happens every time

pyro2927 commented 10 years ago

@rubytastic, I realize it's not caused by Sidekiq, but switching to using Resque worked around the issue.

tonywok commented 10 years ago

I'm running into this problem as well on rails 4.0.2 and pg 9.3

I have a long running request (rarely hit) that I think is causing other requests to queue, and eventually time out in some cases. I'm assuming those abrupt timeouts are leaving postgres connections open - and eventually leading to the issue.

I'd like to get a quick fix into my app while I take a stab at a real solution.

Sorry if this a naive question, @mperham, but if monkey patching the old behavior isn't sufficient for sidekiq users, how did sidekiq not experience an issue in versions of rails 3 where this problem didn't exist? I was hoping to do that in the meantime. I only ask as it would be good information to know when trying to come up with a fix.

Thanks for all the excellent info on this issue :)

rails / rails

Rails 4.0.0 and 4.0.1 PostgreSQLAdapter does not properly check whether a connection is active #12867

To Fix