uber-archive / multitransport-jsonrpc

JSON-RPC Client (Node.js & Browser) and Server (Node.js) aim at "natural looking" server and client code.
116 stars 22 forks source link

Fix client crash if corrupt TCP packet makes it through #19

Closed dfellis closed 11 years ago

dfellis commented 11 years ago

If a corrupt TCP packet makes it through and causes a JSON.parse error, the code was supposed to emit a 'babel' event and reconnect, but a bug in the client data callback code caused it to crash, instead.

countrodrigo commented 11 years ago

Looks fine. Do you have any tests for handling corrupt data?

dfellis commented 11 years ago

Not exactly. We have a test for hard disconnects, but not for corrupt TCP packets that somehow get past the TCP checks since we'd have to somehow reliably generate a corrupt message that gets past the TCP checks.

If you've got an idea on that (or some sort of mock mechanism to simulate it) I'm all ears. :)

countrodrigo commented 11 years ago

So how are corrupt tcp packets making it past the tcp checks? I am assuming you added this extra hardening for some practical reason

dfellis commented 11 years ago

The simple answer is "law of large numbers" and reliability engineering fundamentals.

The TCP link is essentially a single point of failure, and failure in that link is essentially catastrophic in our simple protocol (bare pascal strings). We're relying on TCP's own error correction recovery mechanisms to improve the reliability of our system.

But TCP reliability is statistical in nature. I don't remember the actual odds off the top of my head, but packets can get corrupted and get past the TCP protocol. We didn't see it for the entire 2 months the TCP code was running in production on a few servers, but after a few days after we switched everything over to it, it happened. The sheer flood of packets being transmitted back and forth meant that eventually it would happen.

The code that was intended to catch this had a flaw in it and caused the server to crash and restart instead of just emitting a babel event and kill the connection and re-open it (our mechanism for improving the reliability of the communication connection by essentially narrowing down the "maintenance window" to effectively zero).

So... the answer to your question is that while packet corruption making it past the TCP protocol is uber-unlikely, we are uber-scale. :) While you can't really calculate an MTBF from just one data-point, it looks like its a once-a-month kind of thing, and will become more common as Uber grows and more of our stack uses multitransport-jsonrpc. :)

countrodrigo commented 11 years ago

I like the long answer, much more enlightening.

dfellis commented 11 years ago

10 characters of code supported by that explanatory text. :) Crazy, eh?