Pool destroyed permanently after temporary network issue

jdleesmiller commented 8 years ago

This issue sort of sits between knex and pool2, but I think pool2 could do some things to make it easier for knex (and other callers) to handle this situation, so I figured I'd open the issue here. It's also similar to #12, but it's not random, so I don't think it necessarily fits there.

Our setup: node 4.4.1, pool2 1.3.4, knex 0.10.0.

We had a failure in production today after the following series of events:

A web server had a temporary network issue. Attempts to open new database connections started hanging.
pool2 was running in its default configuration (via knex) with bailAfter == 0, so it destroyed itself after the first acquire timed out.
Knex logged the error but didn't crash the process (this was on 0.10, but it looks like this behaviour is still the same on knex master).
The network came back. The server tried to process requests, but its database connection pool had been destroyed, so all requests failed with "Pool is destroyed" errors.

I think two possible solutions are:

Set bailAfter to Infinity by default in knex or (as a workaround) in the knexfile's pool config. That way, those requests would have failed to acquire connections, but the pool would not have been destroyed, and eventually it would have been able to acquire connections again.
Make knex crash on pool2 errors. Had knex crashed instead of just logging the error and carrying on with a destroyed pool, the process would have been restarted, and eventually it would have come up in a working state when the network came back.

At present, pool2 emits quite a few different types of errors, and not all of them require a crash, so knex would need some way to detect that the pool had destroyed itself and that the process either had to crash, or the pool had to be reinitialized. I don't see anything in knex that does this in 0.10.0 or on master.

Should pool2 make 'pool destroyed' a special type of event? Or should knex or pool2 set bailAfter to Infinity by default? Or maybe something else? I would be interested to get your views.

Example

To test the idea that setting bailAfter to Infinity avoids destroying the pool, I wrote the following test script. It simulates a situation in which the first acquire times out, but subsequent acquires finish in a timely manner. If bailAfter == 0, we get the problem we had today; if bailAfter == Infinity, it eventually recovers.

pool2_test.js:

var debug = require('debug')('test')
var Pool2 = require('pool2')

// We will make the first `acquire` take 2s, which is over the 1s timeout below.
// After that, it will return the resource quickly.
var firstAcquire = true
function slowFastAcquire (cb) {
  var delay = (firstAcquire ? 2 : 0) * 1000
  debug('acquiring with ' + delay + 'ms delay')
  setTimeout(function () {
    cb(null, 'resource')
  }, delay)
  firstAcquire = false
}

var pool = new Pool2({
  acquire: slowFastAcquire,
  acquireTimeout: 1 * 1000,
  dispose: function (resource, cb) { cb() },
  bailAfter: 0 // or Infinity
})

// Log the error but don't crash.
pool.on('error', function (err) {
  debug('pool error: ' + err)
})

// Make first request.
// If bailAfter == 0, this will time out, and the pool will be destroyed.
// If bailAfter == Infinity, this will fail, retry the acquire, and then succeed.
pool.acquire(function (err, resource) {
  if (err) {
    debug('1: failed ' + err)
    return
  }
  debug('1: acquired ' + resource)
})

// Make second request a bit later.
// If bailAfter == 0, the pool will have already been destroyed.
// If bailAfter == Infinity, this will succeed.
setTimeout(function () {
  pool.acquire(function (err, resource) {
    if (err) {
      debug('2: failed ' + err)
      return
    }
    debug('2: acquired ' + resource)
  })
}, 3*1000)

If I run it with bailAfter == 0 (the default), the pool is destroyed:

$ DEBUG=* node pool2_test.js
  pool2 New ResourceRequest (id=0, timeout=Infinity) +0ms
  pool2 Growing pool: no resource to serve request (p=1, tba=0, tbt=0, max=10) +8ms
  pool2 Attempting to acquire resource (cur=0, ac=0) +6ms
  test acquiring with 2000ms delay +2ms
  pool2 Timed out acquiring resource +1s
  pool2 Couldn't allocate new resource: Timed out acquiring resource +1ms
  pool2 Destroying pool: unable to aquire a resource within 0s +0ms
  pool2 ResourceRequest: reject (id=0) +2ms
  pool2 ResourceRequest: fulfilling with error: Pool was destroyed (id=0) +0ms
  pool2 ResourceRequest: clearing timeout (id=0) +1ms
  test 1: failed Error: Pool was destroyed +1ms
  test pool error: Error: Timed out acquiring resource +0ms
  pool2 Attempting to gracefully clean up late-arrived resource (id=undefined) +997ms
  pool2 Attempting to gracefully remove resource (id=undefined) +1ms
  test 2: failed Error: Pool is destroyed +987ms

Changing bailAfter to be Infinity, the pool eventually recovers:

$ DEBUG=* node pool2_test.js
  pool2 New ResourceRequest (id=0, timeout=Infinity) +0ms
  pool2 Growing pool: no resource to serve request (p=1, tba=0, tbt=0, max=10) +6ms
  pool2 Attempting to acquire resource (cur=0, ac=0) +3ms
  test acquiring with 2000ms delay +2ms
  pool2 Timed out acquiring resource +1s
  pool2 Couldn't allocate new resource: Timed out acquiring resource +1ms
  pool2 Attempting to gracefully clean up late-arrived resource (id=undefined) +1s
  pool2 Attempting to gracefully remove resource (id=undefined) +1ms
  pool2 New ResourceRequest (id=1, timeout=Infinity) +991ms
  pool2 Growing pool: no resource to serve request (p=2, tba=0, tbt=0, max=10) +3ms
  pool2 Attempting to acquire resource (cur=0, ac=0) +2ms
  test acquiring with 0ms delay +2ms
  pool2 Successfully allocated new resource (cur=0, ac=0, id=undefined) +3ms
  pool2 Reserving request for resource (id=undefined, req=0) +2ms
  pool2 Pinging resource (id=undefined) +2ms
  pool2 Allocating resource to request (id=undefined, req=0); waited 3.021s +1ms
  pool2 ResourceRequest: resolve (id=0) +2ms
  pool2 ResourceRequest: fulfilling with resource (id=0) +1ms
  pool2 ResourceRequest: clearing timeout (id=0) +1ms
  test 1: acquired resource +2ms
  pool2 reap (cur=1, av=0) +7s
  pool2 Growing pool: no resource to serve request (p=1, tba=0, tbt=1, max=10) +1ms
  pool2 Attempting to acquire resource (cur=1, ac=0) +0ms
  test acquiring with 0ms delay +0ms
  pool2 Successfully allocated new resource (cur=1, ac=0, id=undefined) +1ms
  pool2 Reserving request for resource (id=undefined, req=1) +1ms
  pool2 Pinging resource (id=undefined) +0ms
  pool2 Allocating resource to request (id=undefined, req=1); waited 7.003s +0ms
  pool2 ResourceRequest: resolve (id=1) +0ms
  pool2 ResourceRequest: fulfilling with resource (id=1) +0ms
  pool2 ResourceRequest: clearing timeout (id=1) +0ms
  test 2: acquired resource +0ms
  pool2 reap (cur=1, av=0) +10s
  pool2 reap (cur=1, av=0) +10s

Finally, thanks for all your work on making and maintaining this great library.

myndzi commented 8 years ago

Hi, thanks for your feedback.

The bailAfter configuration is actually a backwards compatibility issue. It would be a potentially breaking change to push, unfortunately. Its original intent was to notify of configuration failures instead of silently retrying for eternity (silent errors are the worst errors).

Note that it only errors on the FIRST failure, not any failure. It sounds like perhaps your application was restarted, though, which would be problematic for a temporary outage. I do agree knex should probably crash on a pool destroyed error. I believe an error is emitted and maybe not caught, but I could certainly make it easier to know a fatal error so it can be dealt with accordingly. The best solution would probably be a specific error type or code that can be checked, as opposed to a special event.

I'm not entirely certain if there's an easy way for me to address the problem without changes in knex, and I'm not sure that knex is actively releasing new code at the moment. Knex does pass through the 'pool' config data though, so you should be able to, in your Knex config, specify a bailAfter option to help deal with your problem.

I think the problem lies in catching and not reacting to an error event, which is a crashable offense in Node unless handled. I don't think it would be right to manually process.exit(), and while I could provide a callback or something that let a user do this themselves, they would still have to do it manually. That is slightly better than digging into knex internals to find the pool instance and bind event handlers to it, though. Perhaps also open an issue on knex and link to this, and let's see if there are any plans for a new knex release any time soon?

jdleesmiller commented 8 years ago

Thanks for getting back to me so quickly!

Note that it only errors on the FIRST failure, not any failure. It sounds like perhaps your application was restarted, though, which would be problematic for a temporary outage.

Good shout. I've rechecked the logs, and you are right: the process did crash due to another issue related to the network problem, then it was restarted while the network was still down, and then the acquire timeout happened before the new process's pool got to the live state.

Its original intent was to notify of configuration failures instead of silently retrying for eternity (silent errors are the worst errors).

Yes, that's a good point. I wonder whether there would be other ways of getting that error information out? Maybe something like: if we have failed to acquire any resources to satisfy a request, store the last error from the pool's acquire function and give it to the requester? It seems like it would require some pretty significant changes (and it may not be a good idea for other reasons --- I'm not very familiar with pool2).

Perhaps also open an issue on knex and link to this, and let's see if there are any plans for a new knex release any time soon?

OK, I have opened https://github.com/tgriesser/knex/issues/1634

myndzi commented 8 years ago

A network issue would cause a failure repeatedly, so there's no need to store the error -- the same error would occur on a retry. I haven't really had a lot of feedback either way on this 'feature' so it might be worth changing in a future version, e.g. 2.0. As you mentioned, most prod applications should have something that restarts the program if it crashes, and that is how this is meant to interact. It should fail noisily and loudly when the database can't connect, not run silently.

myndzi commented 8 years ago

So, it looks like knex is removing support for pool2 in favor of another library. As a result, I probably won't be doing much more with this library. Hopefully the new setup works around your problem. I don't believe node-pool behaves the same way as pool2 here, so your problem shouldn't exist. If I was aware of a strong need to change this behavior I probably would have just published a 2.0 and done away with it, but this came as a bit of a surprise to me... either way, closing this issue as there's no easy forward path that doesn't break semver from here. Feel free to reopen if you think there's something further I can address for you, though!

jdleesmiller commented 8 years ago

Thanks for letting me know, @myndzi . Onwards and upwards!

tgriesser commented 8 years ago

@myndzi hey - sorry I wasn't more responsive on the issues. Honestly the change just came down to the fact that I haven't had time to debug the internals of pool2 as much as I've wanted to see what all the issues coming up were about and how to best address them.

In knex 1.0, I'm likely going to end up utilizing the built-in pools that come bundled with different database libraries, and then show an example of how a user could choose to define a pool implementation like pool2 themselves and make good use of the more advanced features like clustering & capabilities.

myndzi commented 8 years ago

Yeah, it's no worries. I don't expect you to debug pool2 any more than I can feasibly dig into knex at the moment :) I just hope the problem goes away (and not in the sense that it still exists but is silent). There's been enough noise that it seems that something (besides user error) is a problem.

In retrospect the idea of "error when I've never established a connection" may have been poorly chosen, but I hope it helped more people avoid the frustration of "it's running fine but not working, wtf is wrong?" than it caused frustration for.

tgriesser commented 8 years ago

I just hope the problem goes away (and not in the sense that it still exists but is silent). There's been enough noise that it seems that something (besides user error) is a problem.

Yeah I think the main issue was on my end in that knex was intercepting any error events and not doing anything to recover from them, but it seemed there were quite a few errors that could potentially be emitted and didn't dig in enough to see what they are / where they should be intercepted. I think I'll probably try and add in something that addresses the original "error when I've never established a connection" concept into knex, to help people who have trouble with their connection config.

myndzi / pool2

Pool destroyed permanently after temporary network issue #28

Example