uber-archive / multitransport-jsonrpc

JSON-RPC Client (Node.js & Browser) and Server (Node.js) aim at "natural looking" server and client code.
116 stars 22 forks source link

TCP client transport stops buffering after reconnection #59

Closed yelworc closed 9 years ago

yelworc commented 9 years ago

I am working on a node cluster application where the worker processes are communicating with each other via multitransport-jsonrpc (over TCP). To conveniently sidestep timing issues during startup, I am relying on the built-in reconnection mechanism: each process starts up its own server, and starts up clients for all other processes. Most of the time this works, but occasionally I'm seeing Error: Connection Unavailable messages. When this happens, the logs during the connection setup phase contain something like the following from the multitransport-jsonrpc client:

onClose.reconnect - old con is: undefined
new.con.created - con: 0.6775611275807023
con.error - 0.6775611275807023
con.close - 0.6775611275807023
onClose null
onClose if (retries) - old con is: null
call onClose.reconnect for retry === 1 - old con: null
onClose.reconnect - old con is: null
new.con.created - con: 0.9961759466677904
net.connect.callback - new con: 0.9961759466677904. old con: null
Stopping the buffering of requests on 0.9961759466677904

After comparing this with the client TCP transport code, it looks to me like stopBuffering is scheduled (and called) right after the first reconnection attempt (unless I'm missing something, which is not unlikely). I'm not specifying the stopBufferingAfter option explicitly, so it should be 0 by default; the current behavior seems to be the opposite of what the docs for it suggest, though.

dfellis commented 9 years ago

Hi @yelworc I can see how this particular corner case could be hit. It looks like a refactor of the reconnect logic over a year ago introduced this bug. (We never noticed because our usage of the TCP transport expects network failures, mostly if one of the servers in the cluster it's talking to goes down, and don't use that default of 0, which was designed for a smaller single-server setup or one-off scripts.)

I have a branch up that I think should fix this, but I'm struggling to figure out a unit test to reproduce this case and prevent regressions. (The branch name is "fix-retry-logic-in-buffering-case".)

Once I've done that, I'll put up a PR for it and link it to this issue. Could you try out that branch and see if it solves the intermittent reconnect issues you've seen, however, so I know this is the actual fix? (Or if you have an example file to demonstrate this issue, that would be wonderful.)

yelworc commented 9 years ago

Incidentally, your fix is very similar to the fix I hacked into my copy earlier today :) So far it is looking good. I'll let it run for a few days and report back here.

yelworc commented 9 years ago

So, since applying your fix a few days ago, I haven't seen that problem again. Thinking about the issue more broadly (and reading between the lines in your above reply), I'm coming to the conclusion that it is probably a bad idea to use that 0 parameter anyway, right? If it just keeps buffering incoming requests without being able to reconnect, something else is just bound to break eventually. I should probably rather set that parameter to an appropriate limit, and escalate the issue (ie. send alarming log messages or shut down or something) if it can't reconnect within that time.

dfellis commented 9 years ago

Yeah, I'd agree with that. :)

I do wonder, though, why not simply register a done callback on the clients and block your server initialization from completing until they've all run, as well? Then you don't even run the risk of sending messages that could get the Connection Unavailable message unless one of the worker processes crashes?

I'm going to merge that branch into mainline and publish a patch version.

dfellis commented 9 years ago

Now published as 0.7.2

yelworc commented 9 years ago

I'm already using the done callback as you suggested; I guess it was the rpc.methodList function that triggered the error. And yes, I need it to survive a crash/restart of a single worker process, too. Anyway, many thanks for the patch, and for the library in general!