ssbc / ssb-db

A database of unforgeable append-only feeds, optimized for efficient replication for peer to peer protocols
https://scuttlebot.io/
MIT License
1.17k stars 75 forks source link

switch to JSON encoding? #33

Closed dominictarr closed 9 years ago

dominictarr commented 10 years ago
<domanic> pfraze, I have been wondering whether we should simply make secure scuttlebutt use json?
<pfraze> domanic, you think?
<domanic> would make implementing much easier because you wouldn't have to implement a binary parser, and debug it
<domanic> (thinking of third party implementations here)
<domanic> binary is awkward though, we'd need to use base64 for hashes and signatures
<pfraze> yeah, that'd be the downside
<domanic> but... we could still save bandwidth by just compressing the streams
<domanic> that would probably bring it back down to a binary format
<domanic> question: will it be faster than binary?
<domanic> this is slightly faster: https://github.com/mafintosh/protocol-buffers
<pfraze> might be in js environments, yeah
<pfraze> interesting, how recent is that?
<domanic> only a few months old
<pfraze> hmm. Not an order of magnitude, but still faster
<domanic> yeah, < 10% faster
<domanic> but the C implementation will be faster
<mafintosh> pfraze, domanic: if you checkout the source so see that readability is a trade-off for getting the performance we ended up getting out of the protocol-buffers module
<domanic> mafintosh, that is what I'm thinking - it's better to optimize for adoptability
<domanic> make it easier to implement a competing implementation - you need that for a p2p protocol to be truly decentralized
<mafintosh> domanic: protocol-buffers parsers are widely available though for almost every platform/language
<domanic> true, but for ssb we need an unstructured embedded format anyway
<domanic> like json - currently we are using msgpack
<domanic> ... but the js implementation for that is slower than javascript
<mafintosh> domanic: ah okay
<pfraze> yeah, pbuf requires a schema definition
<pfraze> json is wire readable
<domanic> on the other hand - there is a kinda macho thing here...
<domanic> binary protocol is more hard core
<mafintosh> domanic: i don't think you'll be able to get JSON like perf out of a non-schema binary protocol anyways
<domanic> and if you are gonna mess with crypto then you better be able to handle binary
<domanic> mafintosh, yeah, no especially not in pure js
<mafintosh> domanic: unrelated, we can probably speed up https://github.com/dominictarr/varstruct by orders of magnitude if we code generate the parsers etc
<domanic> mafintosh, for sure
<domanic> the problem with JSON though, when doing crypto/signatures is that JSON is unstable
<domanic> because of eg, whitespace. 
<nathan7> sorted JSON works
<mafintosh> domanic: also how unicode characters are encoded
<domanic> nathan7, but if you have to sort the json then you don't have the perf of the native JSON implementation
<domanic> this might be better: https://camlistore.googlesource.com/camlistore/+/master/doc/json-signing/json-signing.txt
<mafintosh> i ran into that problem yesterday while trying to generate shasums for docker images (their JSON encoded mapped unicode chars to '\u...' etc
<domanic> actually I'm gonna test that
<jbenet> domanic: protobuf with optional fields
<domanic> jbenet, that works for the outer, but the inner message has an arbitary structure
<jbenet> Oh yeah, I just use an opaque 'bytes Data' field
<domanic> jbenet, that is what we started with too
<domanic> but we decided there was a lot to be gained from a consistent format
<domanic> in particular - we could index structures within messages
<domanic> so if apps create messages that securely refer to other messages, we can detect that without relying on the app (which might be untrustworthy)
<domanic> pfraze, okay so straight forward JSON.stringify is 4 times faster than sorting first
<pfraze> domanic, how does it compare to msgpack-js?
<domanic> pfraze, okay - sorted stringify is slower than msgpack + varstruck
<domanic> but json is still 2.5 times faster than varstruct/msgpack
<pfraze> will the character encoding and whitespace issues be significant?
<pfraze> if not, given the wire-readability, I think that might be the right call
<domanic> okay so there are few other factors
<domanic> how fast to verify a signature?
<pfraze> there'd probably be a lot of base64/buffer conversions
<domanic> then only thing is that we'd either have to reencode consistently, OR to keep the encoded version around for every decoded message
<domanic> my hunch is the latter is gonna be cheaper

issues to investigate

this is a nice method that works around the whitespace problem: https://camlistore.googlesource.com/camlistore/+/master/doc/json-signing/json-signing.txt

maybe we should do that, and keep the encoded form around so that "reencoding" is just returning that string.

Also, if all objects can be a single line, then we can just use line separated json, which is much cheaper than a streaming json parser which must be implemented in js.

dominictarr commented 10 years ago

Okay so the problem JSON is that it sucks for binary. If we wanted to send a gzipped tarball or a png we'd have to base64 encode it, and that feels dumb... (although, meatspace works with base64 images and is cool)

I was thinking we'd have to use a binary framing around a text encoding... but substack's readme shows a more brutally simple approach. if you use the hash of the object as the delimiters for that object, that is the one string that cannot be inside that object... because the hash did not exist before the object was created. this means you could have a text protocol that supports binary without needing length delimiters... so for plain text objects, it will be readable.

Turns out this isn't actually how substack's module works, but this would work.

hmm. Are length delimeters even that bad? I guess we could also use the redis protocol that supports binary content, with length delimiters and is human readable if the objects are just text.

dominictarr commented 9 years ago

implemented.