irungentoo / toxcore

The future of online communications.
https://tox.chat/
GNU General Public License v3.0
8.74k stars 1.26k forks source link

Use Protocol Buffers or some other executable form of protocol specification for all network communications #137

Closed jbangert closed 10 years ago

jbangert commented 11 years ago

Currently, Tox uses handwritten and ad-hoc serialization and deserialization code on the network. In addition to possibly leading to security vulnerabilities (hand-writing parsers in C is prone to security errors), this makes it hard to extend the protocol in the future or to write implementations in another language. With something like protocol buffers, we can write a protocol specification (in e.g. the protobuf interface descriptor language) and automatically create all of the network-parsing code from there (reducing the amount of critical hand-written code that could have security vulns). If the protocol extends in the future, protobuf guarantees some level of backwards compatibility (i.e. the new message would be discarded if it is 'optional'). Furthermore, if someone else wants to implement the ProjectTox core in another library or language, that implementation would not have to reimplement the network parser, guaranteeing that different implementations will be compatible).

For background on why hand written parsers are bad, see this awesome talk: http://prezi.com/rhlij_momvrx/langsec-2011-2016/

ghost commented 11 years ago

@jbangert Please take into account that the tox, nTox and toxic are written in C. PB is written in C++, which means that adding a lot of blob to the linked code. The second issue is the complexity of such a solution - it is good to keep the core as much as possible free from any unnecessary dependencies. Simplicity is the key to safety.

jbangert commented 11 years ago

Complexity is not measured in lines of code or binary size - complexity is how hard it is to understand the code and (with respect to security, which should be THE FIRST concern for something like tox) the security assumptions made therein. Checking if your protocol parser makes any unwarranted assumptions is a sufficient complexity and introduces a lot of new code, whereas we can be relatively certain that the Protocol Buffers code is pretty well vetted (and you don't need to understand its internals in order to use it, that's the beauty of abstraction).

A shorter version of the argument against hand-rolled parsing is here (believe me, noone has ever gotten a parser for a complicated protocol right. Especially not in C) : http://www.cs.dartmouth.edu/~sergey/langsec/occupy/

ghost commented 11 years ago

C vs C++'s blob: https://github.com/orangeduck/libCello/issues/23

You gave a very bad example with 'occupy'. Exchange protocols (ITCH, OUTCH etc.) are usually very simple. Nobody is going to risk using an external parser in such critical code.

Protocol Buffer was not originally designed for messaging but as a storage format, so rather closer to NoSQL. If you no longer want to use an external parser is much better for the tox will be blink or msgpack.

http://blinkprotocol.org/ http://blinkprotocol.org/s/tutorial.html

http://blog.blinkprotocol.org/2013/01/blink-compared-to-google-protocol.html

blink (which was designed for messaging) vs PB (which was designed for data storage):

GPB     16 secs 160 ns/msg 16 bytes/msg   ~6 million msgs/sec
Blink  2.4 secs  24 ns/msg 11 bytes/msg  ~40 million msgs/sec
ghost commented 11 years ago

If security, then netstring protocol may be a good solution for us (used e.g. in qmail).

http://cr.yp.to/proto/netstrings.txt https://github.com/PeterScott/netstring-c

jbangert commented 11 years ago

a) "Nobody is going to risk using an external parser in such critical code." - What do you mean? not invented here? Google's core engineers are better (and their code gets more review, attention, etc). than anything we can produce.

b) Protocol Buffer's strength is extensibility and stability. For now, blink is in beta4 and their specification is pretty complicated, whereas of all the alternatives proposed, Protobufs is the only one that has seen production use for a long time . The way to implement message multiplexing (i.e. passing different requests, e.g. DHT, video, chat, etc.) over one channel (i.e. one UDP socket) is to declare all the different message types as optional, which CapnProto doesn't do very well (meaning you have to hand-roll multiplexing). Blink solves this nicely, although slighly more complicated (with message groups).

c) Performance does not matter that much: Project Tox would be equally useful if it came in a 20mb binary, but if it has a security flaw, it is completely useless. Remember that this is a piece of software, probably running in a users desktop session that accepts connections from any node on the Internet.

d) Netstrings are a nice primitive for building packets, but they do not cover extensibility of the protocol (or any other representation for complicated data types) in any way.

ghost commented 11 years ago

@ jbangert: a) "Nobody is going to risk using an external parser in such critical code." - What do you mean? not invented here? Google's core engineers are better (and their code gets more review, attention, etc). than anything we can produce.

You have few exchange protocols: ITCH, OUTCH (NASDAQ), UTP MD, XDP (NYSE), PITCH (BATS). These protocols are in binary form and very easy to convert from/to C/C++ struct. If you produce critical software you want to have a code that you can be verified and tested. You can of course find external parsers for this, but all serious players do their own parsers. The only exception might be FPGAs implementation where whole is written in HDL (VHDL, verilog).

EDIT: FIX protocol is an exception, where it is more profitable to use an external product (free: QuickFIX, fix8). But if we are talking about the native exchange protocols, that is, as I wrote.

This topic, however, does not have much in common with the tox..

ghost commented 10 years ago

@sometwo It usually takes 7 months for issues to timeout so we will just have to wait a bit longer.

irungentoo commented 10 years ago

The Tox protocol is very easy to parse in C which means little chance of issues.