mysqljs / mysql

A pure node.js JavaScript Client implementing the MySQL protocol.
MIT License
18.27k stars 2.52k forks source link

PoolCluster: Cannot enqueue Quit after fatal error #879

Closed benthor closed 10 years ago

benthor commented 10 years ago

I have a PoolCluster of three nodes, all selected into one wildcard Round Robin pool.

canRetry is enabled and removeNodeErrorCount is set to 1

Every 30 seconds, I get a connection from the pool, check for any error and in case everything is fine, I run a 'SELECT 1' on the pool.

This works fine when all 3 nodes are reachable. However, when I take one of them offline (i.e., blackhole the route to that node) and the round robin reaches that node, I often get:

Error: Cannot enqueue Quit after fatal error.
at Protocol._validateEnqueue (node_modules/mysql/lib/protocol/Protocol.js:192:16)
at Protocol._enqueue (node_modules/mysql/lib/protocol/Protocol.js:129:13)
at Protocol.quit (node_modules/mysql/lib/protocol/Protocol.js:87:36)
at PoolConnection.end [as _realEnd] (node_modules/mysql/lib/Connection.js:214:18)
at Pool.end (node_modules/mysql/lib/Pool.js:158:16)
at PoolCluster._increaseErrorCount (node_modules/mysql/lib/PoolCluster.js:135:17)
at node_modules/mysql/lib/PoolCluster.js:153:12
at Pool.<anonymous> (node_modules/mysql/lib/Pool.js:47:16)
at Handshake.Sequence.end (node_modules/mysql/lib/protocol/sequences/Sequence.js:96:24)
at node_modules/mysql/lib/protocol/Protocol.js:367:16

My debugging shows that when this happens, it happens during the getConnection call. Code preceding that is executed but the callback is never called, the function body simply isn't reached before I get the above stacktrace.

Sometimes everything goes fine and the remove event is triggered without issue. Interesting case of a heisenbug.

Note: I am running node-mysql version 2.4.1 and node version 0.10.26

dougwilson commented 10 years ago

Every 30 seconds, I get a connection from the pool, check for error and in case everything is fine, I run a 'SELECT 1' on the pool.

Would it be possible for you to put together a little simple script that does that: makes a PoolCluster with three nodes (just using default connection config for the post) and put a 30 second endless loop that does the SELECT 1? Then see if that gives you the error, and if so, please post it. This will be extremely helpful in tracking down the issue :)

dougwilson commented 10 years ago

Sorry, I spoke too soon. The stack trace actually tells me all I need to know :) I can fix it without feedback. But it would be very interesting if you could edit your local copy and put a console.log(err.stack) here: https://github.com/felixge/node-mysql/blob/master/lib/PoolCluster.js#L153 and post what that displays right before that error you got above.

benthor commented 10 years ago
var mysql = require('mysql');
var fs = require('fs');

var config = JSON.parse(fs.readFileSync('db.json'));

var cluster = new mysql.createPoolCluster({
    canRetry: true,
    removeNodeErrorCount: 1
});

cluster.on('remove', function(id) {
    console.log('[DB] Lost connection to database host', id);
    setTimeout(function() {
        console.log('[DB] Re-adding host', id);
        cluster.add(id, hosts[id]);
    }, 20000);
});

var hosts = {};

config.hosts.forEach(function(host) {
    console.log('[DB] Adding host to cluster:', host);
    var cfg = {
        host: host,
        user: config.user,
        password: config.password,
        database: config.database
    };
    hosts[host] = cfg;
    cluster.add(host, cfg);
});

var pool = cluster.of('*', 'RR');

// periodically test the connection cluster
setInterval(function() {
    console.log('trying to get connection');
    pool.getConnection(function(err, conn) {
        if (err) {
            console.log(err);
            return;
        }
        console.log(conn.config.host);
        conn.query('SELECT 1', function(err, rows) {
            if (err) {
                console.log(err);
                return;
            }
            console.log('all is well');
        });
    });
}, 30000);

triggers the problem reproducibly. Here is the error stack that is generated:

Error: Handshake inactivity timeout
    at Handshake.sequence.on.on.on.on.on.self._connection._startTLS.err.code (mysqltest/node_modules/mysql/lib/protocol/Protocol.js:153:17)
    at Handshake.EventEmitter.emit (events.js:92:17)
    at Handshake._onTimeout (mysqltest/node_modules/mysql/lib/protocol/sequences/Sequence.js:116:8)
    at Timer.listOnTimeout [as ontimeout] (timers.js:110:15)
    --------------------
    at Protocol._enqueue (mysqltest/node_modules/mysql/lib/protocol/Protocol.js:135:48)
    at Protocol.handshake (mysqltest/node_modules/mysql/lib/protocol/Protocol.js:52:41)
    at PoolConnection.connect (mysqltest/node_modules/mysql/lib/Connection.js:108:18)
    at Pool.getConnection (mysqltest/node_modules/mysql/lib/Pool.js:42:23)
    at PoolCluster._getConnection (mysqltest/node_modules/mysql/lib/PoolCluster.js:151:13)
    at PoolNamespace.getConnection (mysqltest/node_modules/mysql/lib/PoolCluster.js:191:11)
    at null.<anonymous> (mysqltest/mysqlbug.js:39:10)
    at wrapper [as _onTimeout] (timers.js:252:14)
    at Timer.listOnTimeout [as ontimeout] (timers.js:110:15)

Error: Cannot enqueue Quit after fatal error.
    at Protocol._validateEnqueue (mysqltest/node_modules/mysql/lib/protocol/Protocol.js:192:16)
    at Protocol._enqueue (mysqltest/node_modules/mysql/lib/protocol/Protocol.js:129:13)
    at Protocol.quit (mysqltest/node_modules/mysql/lib/protocol/Protocol.js:87:36)
    at PoolConnection.end [as _realEnd] (mysqltest/node_modules/mysql/lib/Connection.js:214:18)
    at Pool.end (mysqltest/node_modules/mysql/lib/Pool.js:158:16)
    at PoolCluster._increaseErrorCount (mysqltest/node_modules/mysql/lib/PoolCluster.js:135:17)
    at mysqltest/node_modules/mysql/lib/PoolCluster.js:154:12
    at Pool.<anonymous> (mysqltest/node_modules/mysql/lib/Pool.js:47:16)
    at Handshake.Sequence.end (mysqltest/node_modules/mysql/lib/protocol/sequences/Sequence.js:96:24)
    at mysqltest/node_modules/mysql/lib/protocol/Protocol.js:367:16
dougwilson commented 10 years ago

Awesome! I'll try to get this fixed ASAP, since the issue is that basically any fatal error in the PoolCluster on connection will cause this issue when trying to remove the pool. It will still result in an error after the fix, but the error you'll actually get back from getConnection will be the Handshake inactivity timeout error.

dougwilson commented 10 years ago

However, when I take one of them offline (i.e., blackhole the route to that node) and the round robin reaches that node

I also need to look into this behavior, because from what you're saying it sounds like it should really be heading to an up node instead of erroring out to you.

benthor commented 10 years ago

Awesome! (I was about to propose as much :))

(Otherwise there is not much of a point to clustering, is there?)

(Could you maybe quickly comment on how to idiomatically handle reconnects? I.e., re-adding removed nodes that are reachable again to the cluster? I feel my solution is a bit messy...)

dougwilson commented 10 years ago

Could you maybe quickly comment on how to idiomatically handle reconnects?

The PoolCluster was a contribution and I don't know a lot of the details about it still. Are you saying that once a node errors out, it is permanently removed from the list of nodes, never to return?

benthor commented 10 years ago

At least in v2.0.x, the 'remove' event got called each time a scheduled reconnect in the above style failed anew. That was convenient for trying to re-add a 'timed-out' node periodically. Now it only appears to get called once, although due to the issues mentioned I can not (yet) back this up with any tests.

Maybe a configuration option like 'reAddAfterErrorSeconds' (or something less bulky) could be added that does this automatically? I am only a pedestrian-level JavaScript hacker, otherwise I would dive into this problem myself.

(Edit: reAddNodeAfterTimeout may sound better)

dougwilson commented 10 years ago

Maybe a configuration option like 'reAddAfterErrorSeconds' (or something less bulky) could be added that does this automatically?

Right. I didn't know they didn't get re-added by themselves. Not having that feature is pretty laughable. I will get that added as well, it seems.

benthor commented 10 years ago

Awesome. A+, would bugreport again :+1: ;)

fragtom commented 10 years ago

I found that issue by searching "Error: Handshake inactivity timeout"... I've just updated to node 0.10.29 and node-mysql 2.41 and cannot establish connection to database.. The #880 would not solve the problem for me, because I've problems by startup.. Could anybody give an hint to fix it?

dougwilson commented 10 years ago

@fragtom please do not hijack a thread for something unrelated. The answer you are looking for is here: https://github.com/felixge/node-mysql/issues/883#issuecomment-50146160

Just give a larger acquireTimeout to the pool that works for your needs.

maserg commented 10 years ago

I'm so sorry guys, but I think benthor forgot about connection.release(); here

    conn.query('SELECT 1', function(err, rows) {
        if (err) {
            console.log(err);
            return;
        }
        console.log('all is well');
    });

is not it?

ratneshdeepak commented 9 years ago

Getting Quit inactivity timeout again with 2.8.0

Error: Quit inactivity timeout 
 at Quit.sequence.on.on.on.on.on.self._connection._startTLS.err.code (/var/task/node_modules/mysql/lib/protocol/Protocol.js:154:17) 
 at Quit.emit (events.js:92:17) 
 at Quit._onTimeout (/var/task/node_modules/mysql/lib/protocol/sequences/Sequence.js:116:8) 
 at Timer.listOnTimeout [as ontimeout] (timers.js:112:15)
dougwilson commented 9 years ago

Hi @ratneshdeepak , this issue is only regarding the "PoolCluster: Cannot enqueue Quit after fatal error" error, which is different from what you posted.

The error "Quit inactivity timeout" occurs only when your have a misconfigured firewall somewhere on your network dropping FIN-related packets, because we waited 30 seconds for a FIN to close down the TCP connection but ever got it and gave up.

If you need help in addition to this explanation, please feel free to open a new issue :)