Closed yelworc closed 9 years ago
Hi @yelworc I can see how this particular corner case could be hit. It looks like a refactor of the reconnect logic over a year ago introduced this bug. (We never noticed because our usage of the TCP transport expects network failures, mostly if one of the servers in the cluster it's talking to goes down, and don't use that default of 0, which was designed for a smaller single-server setup or one-off scripts.)
I have a branch up that I think should fix this, but I'm struggling to figure out a unit test to reproduce this case and prevent regressions. (The branch name is "fix-retry-logic-in-buffering-case".)
Once I've done that, I'll put up a PR for it and link it to this issue. Could you try out that branch and see if it solves the intermittent reconnect issues you've seen, however, so I know this is the actual fix? (Or if you have an example file to demonstrate this issue, that would be wonderful.)
Incidentally, your fix is very similar to the fix I hacked into my copy earlier today :) So far it is looking good. I'll let it run for a few days and report back here.
So, since applying your fix a few days ago, I haven't seen that problem again. Thinking about the issue more broadly (and reading between the lines in your above reply), I'm coming to the conclusion that it is probably a bad idea to use that 0 parameter anyway, right? If it just keeps buffering incoming requests without being able to reconnect, something else is just bound to break eventually. I should probably rather set that parameter to an appropriate limit, and escalate the issue (ie. send alarming log messages or shut down or something) if it can't reconnect within that time.
Yeah, I'd agree with that. :)
I do wonder, though, why not simply register a done
callback on the clients and block your server initialization from completing until they've all run, as well? Then you don't even run the risk of sending messages that could get the Connection Unavailable
message unless one of the worker processes crashes?
I'm going to merge that branch into mainline and publish a patch version.
I'm already using the done
callback as you suggested; I guess it was the rpc.methodList
function that triggered the error. And yes, I need it to survive a crash/restart of a single worker process, too.
Anyway, many thanks for the patch, and for the library in general!
I am working on a node cluster application where the worker processes are communicating with each other via multitransport-jsonrpc (over TCP). To conveniently sidestep timing issues during startup, I am relying on the built-in reconnection mechanism: each process starts up its own server, and starts up clients for all other processes. Most of the time this works, but occasionally I'm seeing
Error: Connection Unavailable
messages. When this happens, the logs during the connection setup phase contain something like the following from the multitransport-jsonrpc client:After comparing this with the client TCP transport code, it looks to me like
stopBuffering
is scheduled (and called) right after the first reconnection attempt (unless I'm missing something, which is not unlikely). I'm not specifying thestopBufferingAfter
option explicitly, so it should be 0 by default; the current behavior seems to be the opposite of what the docs for it suggest, though.