Closed jdleesmiller closed 8 years ago
Hi, thanks for your feedback.
The bailAfter configuration is actually a backwards compatibility issue. It would be a potentially breaking change to push, unfortunately. Its original intent was to notify of configuration failures instead of silently retrying for eternity (silent errors are the worst errors).
Note that it only errors on the FIRST failure, not any failure. It sounds like perhaps your application was restarted, though, which would be problematic for a temporary outage. I do agree knex should probably crash on a pool destroyed error. I believe an error is emitted and maybe not caught, but I could certainly make it easier to know a fatal error so it can be dealt with accordingly. The best solution would probably be a specific error type or code that can be checked, as opposed to a special event.
I'm not entirely certain if there's an easy way for me to address the problem without changes in knex, and I'm not sure that knex is actively releasing new code at the moment. Knex does pass through the 'pool' config data though, so you should be able to, in your Knex config, specify a bailAfter option to help deal with your problem.
I think the problem lies in catching and not reacting to an error event, which is a crashable offense in Node unless handled. I don't think it would be right to manually process.exit(), and while I could provide a callback or something that let a user do this themselves, they would still have to do it manually. That is slightly better than digging into knex internals to find the pool instance and bind event handlers to it, though. Perhaps also open an issue on knex and link to this, and let's see if there are any plans for a new knex release any time soon?
Thanks for getting back to me so quickly!
Note that it only errors on the FIRST failure, not any failure. It sounds like perhaps your application was restarted, though, which would be problematic for a temporary outage.
Good shout. I've rechecked the logs, and you are right: the process did crash due to another issue related to the network problem, then it was restarted while the network was still down, and then the acquire timeout happened before the new process's pool got to the live
state.
Its original intent was to notify of configuration failures instead of silently retrying for eternity (silent errors are the worst errors).
Yes, that's a good point. I wonder whether there would be other ways of getting that error information out? Maybe something like: if we have failed to acquire any resources to satisfy a request, store the last error from the pool's acquire
function and give it to the requester? It seems like it would require some pretty significant changes (and it may not be a good idea for other reasons --- I'm not very familiar with pool2).
Perhaps also open an issue on knex and link to this, and let's see if there are any plans for a new knex release any time soon?
OK, I have opened https://github.com/tgriesser/knex/issues/1634
A network issue would cause a failure repeatedly, so there's no need to store the error -- the same error would occur on a retry. I haven't really had a lot of feedback either way on this 'feature' so it might be worth changing in a future version, e.g. 2.0. As you mentioned, most prod applications should have something that restarts the program if it crashes, and that is how this is meant to interact. It should fail noisily and loudly when the database can't connect, not run silently.
So, it looks like knex is removing support for pool2 in favor of another library. As a result, I probably won't be doing much more with this library. Hopefully the new setup works around your problem. I don't believe node-pool behaves the same way as pool2 here, so your problem shouldn't exist. If I was aware of a strong need to change this behavior I probably would have just published a 2.0 and done away with it, but this came as a bit of a surprise to me... either way, closing this issue as there's no easy forward path that doesn't break semver from here. Feel free to reopen if you think there's something further I can address for you, though!
Thanks for letting me know, @myndzi . Onwards and upwards!
@myndzi hey - sorry I wasn't more responsive on the issues. Honestly the change just came down to the fact that I haven't had time to debug the internals of pool2 as much as I've wanted to see what all the issues coming up were about and how to best address them.
In knex 1.0, I'm likely going to end up utilizing the built-in pools that come bundled with different database libraries, and then show an example of how a user could choose to define a pool implementation like pool2 themselves and make good use of the more advanced features like clustering & capabilities.
Yeah, it's no worries. I don't expect you to debug pool2 any more than I can feasibly dig into knex at the moment :) I just hope the problem goes away (and not in the sense that it still exists but is silent). There's been enough noise that it seems that something (besides user error) is a problem.
In retrospect the idea of "error when I've never established a connection" may have been poorly chosen, but I hope it helped more people avoid the frustration of "it's running fine but not working, wtf is wrong?" than it caused frustration for.
I just hope the problem goes away (and not in the sense that it still exists but is silent). There's been enough noise that it seems that something (besides user error) is a problem.
Yeah I think the main issue was on my end in that knex was intercepting any error events and not doing anything to recover from them, but it seemed there were quite a few errors that could potentially be emitted and didn't dig in enough to see what they are / where they should be intercepted. I think I'll probably try and add in something that addresses the original "error when I've never established a connection" concept into knex, to help people who have trouble with their connection config.
This issue sort of sits between knex and pool2, but I think pool2 could do some things to make it easier for knex (and other callers) to handle this situation, so I figured I'd open the issue here. It's also similar to #12, but it's not random, so I don't think it necessarily fits there.
Our setup: node 4.4.1, pool2 1.3.4, knex 0.10.0.
We had a failure in production today after the following series of events:
bailAfter == 0
, so it destroyed itself after the firstacquire
timed out.I think two possible solutions are:
bailAfter
toInfinity
by default in knex or (as a workaround) in theknexfile
's pool config. That way, those requests would have failed to acquire connections, but the pool would not have been destroyed, and eventually it would have been able to acquire connections again.Make knex crash on pool2 errors. Had knex crashed instead of just logging the error and carrying on with a destroyed pool, the process would have been restarted, and eventually it would have come up in a working state when the network came back.
At present, pool2 emits quite a few different types of errors, and not all of them require a crash, so knex would need some way to detect that the pool had destroyed itself and that the process either had to crash, or the pool had to be reinitialized. I don't see anything in knex that does this in 0.10.0 or on master.
Should pool2 make 'pool destroyed' a special type of event? Or should knex or pool2 set
bailAfter
to Infinity by default? Or maybe something else? I would be interested to get your views.Example
To test the idea that setting
bailAfter
to Infinity avoids destroying the pool, I wrote the following test script. It simulates a situation in which the firstacquire
times out, but subsequentacquire
s finish in a timely manner. IfbailAfter == 0
, we get the problem we had today; ifbailAfter == Infinity
, it eventually recovers.pool2_test.js
:If I run it with
bailAfter == 0
(the default), the pool is destroyed:Changing
bailAfter
to beInfinity
, the pool eventually recovers:Finally, thanks for all your work on making and maintaining this great library.