uber-archive / multitransport-jsonrpc

JSON-RPC Client (Node.js & Browser) and Server (Node.js) aim at "natural looking" server and client code.
116 stars 22 forks source link

Add TCP client reconnect limiting logic #42

Closed ascandella closed 10 years ago

ascandella commented 10 years ago

We ran into a bug in production today where haproxy had a port bound with no backends, so it was immediately returning a null response. This caused the client TcpTransport to get stuck in an infinite loop, since it resets its retry count back to 0 after establishing a successful connection.

Previously, if the connect was successfully opened, we considered that as having a valid backend. However, for reasons mentioned above, it can be useful to limit the number of times the transport will reconnect to a server in a given time period. This commit adds two new parameters to control this behavior.

cc @dfellis @squamos

dfellis commented 10 years ago

The code looks good, and it'll certainly work, but a heuristic approach may actually be better, here.

Why? Suppose you set the reconnect limit to 10, and then there are legitimate disconnects and reconnects for the server over the course of the next week. If it goes over 10 for the client, that code is now permanently disconnected from the server.

But the heuristic could be based on when a connection is considered "stable". Say a timeout of an hour resets the reconnect level back to zero. I'd make that a second property (that could also be Infinity by default) so we can make sure a long-running client process never gets into a situation where we have to restart it for no "real" reason.

ascandella commented 10 years ago

@dfellis I can implement a heuristic approach if you'd like. It's your project, afterall :)

What do you think are some reasonable defaults? 10 reconnects, reset after a minute? an hour?

dfellis commented 10 years ago

10 and an hour are good.

I wonder if it should just reset the reconnects to zero regardless if it got a stable connection or not? Then every hour it'd retry to connect to a down service if it's truly down for maintenance for a while, and would eventually get a reconnect without intervention.

We've normally seen direct TCP connections (without HAProxy) last for about a month or two of continuous use without failure, but I'd rather not have a monthly setInterval. ;)

Also, if someone does set either value to Infinity, you can special case that to omit starting a setInterval at all.

ascandella commented 10 years ago

@dfellis @squamos I modified this PR to add two new parameters (updated in PR description, commit message and docs).

Care to take another look?

dfellis commented 10 years ago

Looks great! Merging.