gosquared / redis-clustr

Redis Cluster client for Node.js
MIT License
107 stars 25 forks source link

Strange issues with latest redis-parser #10

Closed jbt closed 8 years ago

jbt commented 8 years ago

Since node-redis update to 2.6 brought redis-parser 2.0.0, we've been seeing a lot of errors like this:

{ [AbortError: Fatal error encountert. Command aborted. It might have been processed.]
  code: 'NR_FATAL',
  command: 'QUIT',
  origin: [ReplyError: Protocol error, got "8" as reply type byte. Please report this.] }

where the "8" there varies but seems to be a whole load of different values.

For the time being I'm going to lock down the redis dependency here to ~2.5.0 until we have time to investigate further and figure out what exactly is going on with the newer redis parser.

BridgeAR commented 8 years ago

@jbt what commands and what options do you use? And please activate the debug mode to see what happens inside node_redis.

jbt commented 8 years ago

I've been trying to narrow in on a base case to figure out whether the bug is here in redis-clustr, in node_redis, or in redis-parser. It's proving to be a little tricky to reduce down but as soon as I've got a better idea of exactly what's causing the problems then I'll open issues in the relevant places

jbt commented 8 years ago

Slowly beginning to track this down. It looks like the error is coming most often from a cluster slots command. More information as and when I can find it out.

jbt commented 8 years ago

Putting here mostly for my own reference - I'll pick this up tomorrow and open some proper issues:

It seems not to be an issue with redis-clustr. The following test case seems to throw the same error with redis@2.6.1 but not redis@2.5.3, although only very sporadically:

var redis = require('redis');
[
  // big long list of { host: blah, port: blah }
].forEach(function(c){
   var cli = redis.createClient(c.port, c.host);

   cli.on('error', function(err) {
     console.error(err);
     console.error(err.stack);
   });

   cli.on('ready', function() {
     cli.cluster('slots', function(){});
   });
 })

With only one host in the list it seems to be ~10% of the time, and it'll error with a [ReplyError: Protocol error, got "3" as reply type byte. Please report this.] (where the "3" can be "5" or "\r" or many other things).

Occasionally I'll also see this:

/home/jt/Development/testfoo/node_modules/redis/index.js:710
    if (typeof command_obj.callback === 'function') {
                          ^

TypeError: Cannot read property 'callback' of undefined
    at normal_reply (/home/jt/Development/testfoo/node_modules/redis/index.js:710:27)
    at RedisClient.return_reply (/home/jt/Development/testfoo/node_modules/redis/index.js:816:9)
    at JavascriptRedisParser.Parser.returnReply (/home/jt/Development/testfoo/node_modules/redis/index.js:188:18)
    at JavascriptRedisParser.execute (/home/jt/Development/testfoo/node_modules/redis/node_modules/redis-parser/lib/parser.js:392:12)
    at Socket.<anonymous> (/home/jt/Development/testfoo/node_modules/redis/index.js:267:27)
    at emitOne (events.js:77:13)
    at Socket.emit (events.js:169:7)
    at readableAddChunk (_stream_readable.js:153:18)
    at Socket.Readable.push (_stream_readable.js:111:10)
    at TCP.onread (net.js:531:20)

I'll check up on what redis versions are doing this (pretty sure there're some servers in my test list running old versions) to make sure it's not a silly redis bug, and then open some issues tomorrow when I haven't been staring at it for hours.

BridgeAR commented 8 years ago

The error is very likely not an error on your side. The parser error can pretty much just be returned if there's a mistake in the parser code (or Redis returned faulty data and this is highly unlikable). That's why the error explicitly asks for being reported in the error message ;-).

But currently I don't really know where to search for the error. Please outcomment line 330 of the parser, increase the stack trace limit to a high value (e.g. 20) in the ReplyError line 7 and activate the debug mode (NODE_DEBUG=redis). That output will hopefully help a lot finding the bug.

Ping @Salakar

jbt commented 8 years ago

Ok I tried commenting out the decreaseBufferPool interval and I'm still seeing both these weird errors.

I said yesterday that I could reproduce it with a single redis instance but I think I was wrong - it looks like it only happens with more than one client open. So it looks to me like it could be some sort of shared buffer being used by multiple clients.

Debugging output from two such cases here (for the Cannot read property 'callback' of undefined error) and here (for the reply type byte error).

In both cases all the redis servers in question were running redis 3.0.5, with the test script from my earlier comment. The error doesn't happen reliably but I sometimes have to run the script a few times before seeing the error.

Would you like me to open an issue / multiple issues on redis-parser for this? Happy to provide any other information that'll help track this down.

jbt commented 8 years ago

I also tried uncommenting the buffer_from_socket.toString() log in node_redis line 266 and that produced this output which might help. In that case I'm using two hosts (10.0.105.126:6380 and 10.0.105.254:6380)

BridgeAR commented 8 years ago

I found the issue. I forgot that Buffer.slice() does not copy the buffer but instead returns a reference. Now if there are multiple parsers that receive data concurrently and the parser still holds some data from a former chunk that was sliced out of the bufferPool and the other parser uses the bufferPool too, the first parser holds faulty data in it's buffer from now on.

jbt commented 8 years ago

Glad I could help with tracking this down :) Would you like me to open an issue on node-redis-parser or shall I leave it to you to take it from here?

jbt commented 8 years ago

redis-parser@2.0.1 seems to resolve these issues, thanks @BridgeAR!

I'll do some more checking that everything's good between clustr and redis@2.6 and get an update on npm

jbt commented 8 years ago

Oops, I forgot to close this when I published v1.5.0. All is good now :+1: