lf-lang / lingua-franca

Intuitive concurrent programming in any language
https://www.lf-lang.org
Other
235 stars 63 forks source link

Choosing a data serialization format (eg protocol buffers) #96

Open MattEWeber opened 4 years ago

MattEWeber commented 4 years ago

We've had some discussion recently on drawbacks of protocol buffers, so I thought it would be good to actually learn about and catalogue the differences between some of the popular alternatives. Here's what I found:

A larger but less detailed comparison is at https://en.wikipedia.org/wiki/Comparison_of_data-serialization_formats


Human Readable but Inefficient

----JSON---- pros:

----XML---- pros:


Fast Binary Encodings

----MessagePack---- "It's like JSON. but fast and small."

pros:

----BSON---- See: http://bsonspec.org/ A bin­ary-en­coded seri­al­iz­a­tion of JSON-like doc­u­ments. Basically, native language primitive types get encoded into a JSON-like structure

pros:


Binary Encoding + RPC

----Protocol Buffers---- pros:

----Apache Thrift---- Via https://en.wikipedia.org/wiki/Apache_Thrift


Potentially Interesting But Not Enough Language Support

----FlatBuffers---- pros:

cons:

----Avro---- Schemas are defined in JavaScript

cons:

----Microsoft Bond---- pros:

edwardalee commented 4 years ago

This is really helpful. FlatBuffers and MessagePack look the most interesting to me.

Edward


Edward A. Lee EECS, UC Berkeley eal@eecs.berkeley.edu http://eecs.berkeley.edu/~eal

On Jan 25, 2020, at 2:30 AM, mew2ub notifications@github.com wrote:

 We've had some discussion recently on drawbacks of protocol buffers, so I thought it would be good to actually learn about and catalogue the differences between some of the popular alternatives. Here's what I found:

A larger but less detailed comparison is at https://en.wikipedia.org/wiki/Comparison_of_data-serialization_formats

Human Readable but Inefficient

----JSON---- pros:

  • Human readable
  • Very compatible with TS target
  • Parsers and encoders are available for essentially every conceivable target
  • Many programmers are used to it
  • No setup required
  • Allows arbitrary nesting of arrays and key-value maps (i.e. objects) cons:
  • Relatively slow to parse
  • Large message size
  • Code to validate messages will have to be hand-written inside a reactor. -- Note. JSON schemas do exist, but few people use them

----XML---- pros:

  • Human readable
  • Parsers and encoders are available for essentially every conceivable target
  • Many programmers have used it before
  • Setup is optional. Schemas are necessary for validation but not required.
  • A binary encoding called Efficient XML Interchange exists, but with limited support cons:
  • Very slow to parse
  • Huge message size
  • Schemas are strict and hard to maintain as code evolves over time

Fast Binary Encodings

----MessagePack---- "It's like JSON. but fast and small."

pros:

  • Supported for 101 languages!
  • An efficient binary serialization format.
  • Nestable maps and arrays.
  • No schema means very flexible and no compiler needed. cons:
  • No schema means data isn't validated
  • No attached RPC mechanism like Protocol Buffers or Thrift

----BSON---- See: http://bsonspec.org/ A bin­ary-en­coded seri­al­iz­a­tion of JSON-like doc­u­ments. Basically, native language primitive types get encoded into a JSON-like structure

pros:

  • "more 'schema-less' than Proto­col Buf­fers" means it's more flexible
  • Supported for 27 languages
  • Faster to decode than JSON cons:
  • more 'schema-less' than Proto­col Buf­fers" means it's validated less
  • slightly less space efficient than protocol buffers and JSON

Binary Encoding + RPC

----Protocol Buffers---- pros:

  • Very small message sizes
  • Very quick to parse
  • Officially supported for 8 potential targets (including C and C++) -- Unofficially supported for 30 additional targets (including JS and TS)
  • Intended to be backward and future compatible with evolving message formats
  • Supports importing other .proto message definitions
  • Well documented
  • Designed for compatibility with GRPC for remote procedure calls cons:
  • Only usable with .proto definitions of message formats
  • Requires installation of a compiler for each language
  • Generated library files for parsing and encoding have to be managed and linked/imported to reactor code
  • Weird type system and rules for writing .proto files -- You have to assign unique numbers to fields -- Messages can be nested, but key-value maps can't -- No explicit lists, but fields containing primitives or other messages can be repeated. Maps can't be repeated
  • Via https://reasonablypolymorphic.com/blog/protos-are-wrong/ -- Missing fields can't be distinguished from fields assigned the default value. -- Message types have counterintuitive behavior with missing values -- msg.foo = msg.foo, isn't a no-op. It will silently change msg to have a zero-initialized copy of foo if it previously didn’t have one --- One optional field exists in messages for each case of 'oneof'. Meaning if the msg has oneof 'foo' / 'bar' the msg.foo = msg.foo statement will overwrite whatever data was in bar when it zero-initializes foo. --- That said, I can't think of many reasons why a programmer would write msg.foo = msg.foo in the first place. -- Backwards compatibility is at odds with effective validation. -- Protobuffer's weird type system "infects" the type system of any code that has to deal with it. -- Notably this essay doesn't give any recommendations for better technologies

----Apache Thrift---- Via https://en.wikipedia.org/wiki/Apache_Thrift

  • "Thrift is written in C++, but can create code for a number of languages. To create a Thrift service, one has to write Thrift files that describe it, generate the code in the destination language, write some code to start the server, and call it from the client." pros:
  • Supports 28 languages: C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and others
  • Thrift is a full on GRPC alternative. Meaning it doesn't just serialize and deserialize data, it also implements an entire stack for transmitting it.
  • Thrift's type system is based on C++ and uses structs and unions for composition. -- Container types are generic maps, sets, and lists, however these cannot be nested within each other cons:
  • Like Protocol Buffers, Thrift uses a custom interface description language that has to be compiled for each language
  • I don't know if thrift's binary serialization is separable from the rest of the RPC stack.
  • Documentation isn't as good as protocol buffers

Potentially Interesting But Not Enough Language Support

----FlatBuffers---- pros:

  • via https://google.github.io/flatbuffers/ -- What sets FlatBuffers apart is that it represents hierarchical data in a flat binary buffer in such a way that it can still be accessed directly without parsing/unpacking, while also still supporting data structure evolution (forwards/backwards compatibility)

cons:

  • Only currently supports C++, C#, C, Go, Java, JavaScript, Lobster, Lua, TypeScript, PHP, Python, and Rust

----Avro---- Schemas are defined in JavaScript

cons:

  • Only currently supports C, C++, C#, Java, Python, and Ruby

----Microsoft Bond---- pros:

  • Supports a very rich type system including inheritance, type aliases, and generics cons:
  • Only supports C++, C#, Java, and Python

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

hokeun commented 2 years ago

reactor1 > ----- number ----- < networkAction > ------ JSON.stringify() / JSON.parse() & typeAnnotation for returned value from parse() ------ < networkAction >--- number ---- < reactor2