Choosing a data serialization format (eg protocol buffers)

We've had some discussion recently on drawbacks of protocol buffers, so I thought it would be good to actually learn about and catalogue the differences between some of the popular alternatives. Here's what I found:

A larger but less detailed comparison is at https://en.wikipedia.org/wiki/Comparison_of_data-serialization_formats

Human Readable but Inefficient

----JSON---- pros:

Human readable
Very compatible with TS target
Parsers and encoders are available for essentially every conceivable target
Many programmers are used to it
No setup required
Allows arbitrary nesting of arrays and key-value maps (i.e. objects) cons:
Relatively slow to parse
Large message size
Code to validate messages will have to be hand-written inside a reactor. -- Note. JSON schemas do exist, but few people use them

----XML---- pros:

Human readable
Parsers and encoders are available for essentially every conceivable target
Many programmers have used it before
Setup is optional. Schemas are necessary for validation but not required.
A binary encoding called Efficient XML Interchange exists, but with limited support cons:
Very slow to parse
Huge message size
Schemas are strict and hard to maintain as code evolves over time

Fast Binary Encodings

----MessagePack---- "It's like JSON. but fast and small."

pros:

Supported for 101 languages!
An efficient binary serialization format.
Nestable maps and arrays.
No schema means very flexible and no compiler needed. cons:
No schema means data isn't validated
No attached RPC mechanism like Protocol Buffers or Thrift

----BSON---- See: http://bsonspec.org/ A binary-encoded serialization of JSON-like documents. Basically, native language primitive types get encoded into a JSON-like structure

pros:

"more 'schema-less' than Protocol Buffers" means it's more flexible
Supported for 27 languages
Faster to decode than JSON cons:
more 'schema-less' than Protocol Buffers" means it's validated less
slightly less space efficient than protocol buffers and JSON

Binary Encoding + RPC

----Protocol Buffers---- pros:

Very small message sizes
Very quick to parse
Officially supported for 8 potential targets (including C and C++) -- Unofficially supported for 30 additional targets (including JS and TS)
Intended to be backward and future compatible with evolving message formats
Supports importing other .proto message definitions
Well documented
Designed for compatibility with GRPC for remote procedure calls cons:
Only usable with .proto definitions of message formats
Requires installation of a compiler for each language
Generated library files for parsing and encoding have to be managed and linked/imported to reactor code
Weird type system and rules for writing .proto files -- You have to assign unique numbers to fields -- Messages can be nested, but key-value maps can't -- No explicit lists, but fields containing primitives or other messages can be repeated. Maps can't be repeated
Via https://reasonablypolymorphic.com/blog/protos-are-wrong/ -- Missing fields can't be distinguished from fields assigned the default value. -- Message types have counterintuitive behavior with missing values -- msg.foo = msg.foo, isn't a no-op. It will silently change msg to have a zero-initialized copy of foo if it previously didn’t have one --- One optional field exists in messages for each case of 'oneof'. Meaning if the msg has oneof 'foo' / 'bar' the msg.foo = msg.foo statement will overwrite whatever data was in bar when it zero-initializes foo. --- That said, I can't think of many reasons why a programmer would write msg.foo = msg.foo in the first place. -- Backwards compatibility is at odds with effective validation. -- Protobuffer's weird type system "infects" the type system of any code that has to deal with it. -- Notably this essay doesn't give any recommendations for better technologies

----Apache Thrift---- Via https://en.wikipedia.org/wiki/Apache_Thrift

"Thrift is written in C++, but can create code for a number of languages. To create a Thrift service, one has to write Thrift files that describe it, generate the code in the destination language, write some code to start the server, and call it from the client." pros:
Supports 28 languages: C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and others
Thrift is a full on GRPC alternative. Meaning it doesn't just serialize and deserialize data, it also implements an entire stack for transmitting it.
Thrift's type system is based on C++ and uses structs and unions for composition. -- Container types are generic maps, sets, and lists, however these cannot be nested within each other cons:
Like Protocol Buffers, Thrift uses a custom interface description language that has to be compiled for each language
I don't know if thrift's binary serialization is separable from the rest of the RPC stack.
Documentation isn't as good as protocol buffers

Potentially Interesting But Not Enough Language Support

----FlatBuffers---- pros:

via https://google.github.io/flatbuffers/ -- What sets FlatBuffers apart is that it represents hierarchical data in a flat binary buffer in such a way that it can still be accessed directly without parsing/unpacking, while also still supporting data structure evolution (forwards/backwards compatibility)

cons:

Only currently supports C++, C#, C, Go, Java, JavaScript, Lobster, Lua, TypeScript, PHP, Python, and Rust

----Avro---- Schemas are defined in JavaScript

cons:

Only currently supports C, C++, C#, Java, Python, and Ruby

----Microsoft Bond---- pros:

Supports a very rich type system including inheritance, type aliases, and generics cons:
Only supports C++, C#, Java, and Python

This is really helpful. FlatBuffers and MessagePack look the most interesting to me.

Edward

Edward A. Lee EECS, UC Berkeley eal@eecs.berkeley.edu http://eecs.berkeley.edu/~eal

On Jan 25, 2020, at 2:30 AM, mew2ub notifications@github.com wrote:

We've had some discussion recently on drawbacks of protocol buffers, so I thought it would be good to actually learn about and catalogue the differences between some of the popular alternatives. Here's what I found:

A larger but less detailed comparison is at https://en.wikipedia.org/wiki/Comparison_of_data-serialization_formats

Human Readable but Inefficient

----JSON---- pros:

Human readable

Very compatible with TS target

Parsers and encoders are available for essentially every conceivable target

Many programmers are used to it

No setup required

Allows arbitrary nesting of arrays and key-value maps (i.e. objects) cons:

Relatively slow to parse

Large message size

Code to validate messages will have to be hand-written inside a reactor. -- Note. JSON schemas do exist, but few people use them

----XML---- pros:

Human readable

Parsers and encoders are available for essentially every conceivable target

Many programmers have used it before

Setup is optional. Schemas are necessary for validation but not required.

A binary encoding called Efficient XML Interchange exists, but with limited support cons:

Very slow to parse

Huge message size

Schemas are strict and hard to maintain as code evolves over time

Fast Binary Encodings

----MessagePack---- "It's like JSON. but fast and small."

pros:

Supported for 101 languages!

An efficient binary serialization format.

Nestable maps and arrays.

No schema means very flexible and no compiler needed. cons:

No schema means data isn't validated

No attached RPC mechanism like Protocol Buffers or Thrift

----BSON---- See: http://bsonspec.org/ A binary-encoded serialization of JSON-like documents. Basically, native language primitive types get encoded into a JSON-like structure

pros:

"more 'schema-less' than Protocol Buffers" means it's more flexible

Supported for 27 languages

Faster to decode than JSON cons:

more 'schema-less' than Protocol Buffers" means it's validated less

slightly less space efficient than protocol buffers and JSON

Binary Encoding + RPC

----Protocol Buffers---- pros:

Very small message sizes

Very quick to parse

Officially supported for 8 potential targets (including C and C++) -- Unofficially supported for 30 additional targets (including JS and TS)

Intended to be backward and future compatible with evolving message formats

Supports importing other .proto message definitions

Well documented

Designed for compatibility with GRPC for remote procedure calls cons:

Only usable with .proto definitions of message formats

Requires installation of a compiler for each language

Generated library files for parsing and encoding have to be managed and linked/imported to reactor code

Weird type system and rules for writing .proto files -- You have to assign unique numbers to fields -- Messages can be nested, but key-value maps can't -- No explicit lists, but fields containing primitives or other messages can be repeated. Maps can't be repeated

Via https://reasonablypolymorphic.com/blog/protos-are-wrong/ -- Missing fields can't be distinguished from fields assigned the default value. -- Message types have counterintuitive behavior with missing values -- msg.foo = msg.foo, isn't a no-op. It will silently change msg to have a zero-initialized copy of foo if it previously didn’t have one --- One optional field exists in messages for each case of 'oneof'. Meaning if the msg has oneof 'foo' / 'bar' the msg.foo = msg.foo statement will overwrite whatever data was in bar when it zero-initializes foo. --- That said, I can't think of many reasons why a programmer would write msg.foo = msg.foo in the first place. -- Backwards compatibility is at odds with effective validation. -- Protobuffer's weird type system "infects" the type system of any code that has to deal with it. -- Notably this essay doesn't give any recommendations for better technologies

----Apache Thrift---- Via https://en.wikipedia.org/wiki/Apache_Thrift

"Thrift is written in C++, but can create code for a number of languages. To create a Thrift service, one has to write Thrift files that describe it, generate the code in the destination language, write some code to start the server, and call it from the client." pros:

Supports 28 languages: C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and others

Thrift is a full on GRPC alternative. Meaning it doesn't just serialize and deserialize data, it also implements an entire stack for transmitting it.

Thrift's type system is based on C++ and uses structs and unions for composition. -- Container types are generic maps, sets, and lists, however these cannot be nested within each other cons:

Like Protocol Buffers, Thrift uses a custom interface description language that has to be compiled for each language

I don't know if thrift's binary serialization is separable from the rest of the RPC stack.

Documentation isn't as good as protocol buffers

Potentially Interesting But Not Enough Language Support

----FlatBuffers---- pros:

via https://google.github.io/flatbuffers/ -- What sets FlatBuffers apart is that it represents hierarchical data in a flat binary buffer in such a way that it can still be accessed directly without parsing/unpacking, while also still supporting data structure evolution (forwards/backwards compatibility)

cons:

Only currently supports C++, C#, C, Go, Java, JavaScript, Lobster, Lua, TypeScript, PHP, Python, and Rust

----Avro---- Schemas are defined in JavaScript

cons:

Only currently supports C, C++, C#, Java, Python, and Ruby

----Microsoft Bond---- pros:

Supports a very rich type system including inheritance, type aliases, and generics cons:

Only supports C++, C#, Java, and Python

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

lf-lang / lingua-franca