only-cliches / NoProto

Flexible, Fast & Compact Serialization with RPC
MIT License
374 stars 14 forks source link
apache-avro avro bson data-buffers databases deserialization flatbuffers flexbuffers json messagepack protocol-buffers rpc schemas serialization zero-copy

NoProto: Flexible, Fast & Compact Serialization with RPC

Github | Crates.io | Documentation

MIT license crates.io docs.rs GitHub stars

Features

Lightweight

Stable

Easy

Fast

Powerful

Why ANOTHER Serialization Format?

  1. NoProto combines the performance of compiled formats with the flexibilty of dynamic formats:

Compiled formats like Flatbuffers, CapN Proto and bincode have amazing performance and extremely compact buffers, but you MUST compile the data types into your application. This means if the schema of the data changes the application must be recompiled to accomodate the new schema.

Dynamic formats like JSON, MessagePack and BSON give flexibilty to store any data with any schema at runtime but the buffers are fat and performance is somewhere between horrible and hopefully acceptable.

NoProto takes the performance advantages of compiled formats and implements them in a flexible format.

  1. NoProto is a key-value database focused format:

Byte Wise Sorting Ever try to store a signed integer as a sortable key in a database? NoProto can do that. Almost every data type is stored in the buffer as byte-wise sortable, meaning buffers can be compared at the byte level for sorting without deserializing.

Primary Key Management Compound sortable keys are extremely easy to generate, maintain and update with NoProto. You don't need a custom sort function in your key-value store, you just need this library.

UUID & ULID Support NoProto is one of the few formats that come with first class suport for these popular primary key data types. It can easily encode, decode and generate these data types.

Fastest Updates NoProto is the only format that supports all mutations without deserializng. It can do the common database read -> update -> write operation between 50x - 300x faster than other dynamic formats. Benchamrks

Comparison With Other Formats


Compared to Apache Avro - Far more space efficient
- Significantly faster serialization & deserialization
- All values are optional (no void or null type)
- Supports more native types (like unsigned ints)
- Updates without deserializng/serializing
- Works with `no_std`.
- Safely handle untrusted data.


Compared to Protocol Buffers - Comparable serialization & deserialization performance
- Updating buffers is an order of magnitude faster
- Schemas are dynamic at runtime, no compilation step
- All values are optional
- Supports more types and better nested type support
- Byte-wise sorting is first class operation
- Updates without deserializng/serializing
- Safely handle untrusted data.
- All values are optional and can be inserted in any order.


Compared to JSON / BSON - Far more space efficient
- Significantly faster serialization & deserialization
- Deserializtion is zero copy
- Has schemas / type safe
- Supports byte-wise sorting
- Supports raw bytes & other native types
- Updates without deserializng/serializing
- Works with `no_std`.
- Safely handle untrusted data.


Compared to Flatbuffers / Bincode - Data types can change or be created at runtime
- Updating buffers is an order of magnitude faster
- Supports byte-wise sorting
- Updates without deserializng/serializing
- Works with `no_std`.
- Safely handle untrusted data.
- All values are optional and can be inserted in any order.



Format Zero-Copy Size Limit Mutable Schemas Byte-wise Sorting
Runtime Libs
NoProto ~4GB
Apache Avro 2^63 Bytes
JSON Unlimited
BSON ~16MB
MessagePack Unlimited
Compiled Libs
FlatBuffers ~2GB
Bincode ?
Protocol Buffers ~2GB
Cap'N Proto 2^64 Bytes
Veriform ?

Quick Example

use no_proto::error::NP_Error;
use no_proto::NP_Factory;

// An ES6 like IDL is used to describe schema for the factory
// Each factory represents a single schema
// One factory can be used to serialize/deserialize any number of buffers
let user_factory = NP_Factory::new(r#"
    struct({ fields: {
        name: string(),
        age: u16({ default: 0 }),
        tags: list({ of: string() })
    }})
"#)?;

// create a new empty buffer
let mut user_buffer = user_factory.new_buffer(None); // optional capacity

// set the "name" field
user_buffer.set(&["name"], "Billy Joel")?;

// read the "name" field
let name = user_buffer.get::<&str>(&["name"])?;
assert_eq!(name, Some("Billy Joel"));

// set a nested value, the first tag in the tag list
user_buffer.set(&["tags", "0"], "first tag")?;

// read the first tag from the tag list
let tag = user_buffer.get::<&str>(&["tags", "0"])?;
assert_eq!(tag, Some("first tag"));

// close buffer and get internal bytes
let user_bytes: Vec<u8> = user_buffer.finish().bytes();

// open the buffer again
let user_buffer = user_factory.open_buffer(user_bytes);

// read the "name" field again
let name = user_buffer.get::<&str>(&["name"])?;
assert_eq!(name, Some("Billy Joel"));

// get the age field
let age = user_buffer.get::<u16>(&["age"])?;
// returns default value from schema
assert_eq!(age, Some(0u16));

// close again
let user_bytes: Vec<u8> = user_buffer.finish().bytes();

// we can now save user_bytes to disk, 
// send it over the network, or whatever else is needed with the data

# Ok::<(), NP_Error>(()) 

Guided Learning / Next Steps:

  1. Schemas - Learn how to build & work with schemas.
  2. Factories - Parsing schemas into something you can work with.
  3. Buffers - How to create, update & compact buffers/data.
  4. RPC Framework - How to use the RPC Framework APIs.
  5. Data & Schema Format - Learn how data is saved into the buffer and schemas.

Benchmarks

While it's difficult to properly benchmark libraries like these in a fair way, I've made an attempt in the graph below. These benchmarks are available in the bench folder and you can easily run them yourself with cargo run --release.

The format and data used in the benchmarks were taken from the flatbuffers benchmarks github repo. You should always benchmark/test your own use case for each library before making any choices on what to use.

Legend: Ops / Millisecond, higher is better

Format / Lib Encode Decode All Decode 1 Update 1 Size (bytes) Size (Zlib)
Runtime Libs
NoProto
no_proto 1393 1883 55556 9524 308 198
Apache Avro
avro-rs 156 57 56 40 702 337
FlexBuffers
flexbuffers 444 962 24390 294 490 309
JSON
json 609 481 607 439 439 184
serde_json 938 646 644 403 446 198
BSON
bson 129 116 123 90 414 216
rawbson 130 1117 17857 89 414 216
MessagePack
rmp 661 623 832 202 311 193
messagepack-rs 152 266 284 138 296 187
Compiled Libs
Flatbuffers
flatbuffers 3165 16393 250000 2532 264 181
Bincode
bincode 6757 9259 10000 4115 163 129
Postcard
postcard 3067 7519 7937 2469 128 119
Protocol Buffers
protobuf 953 1305 1312 529 154 141
prost 1464 2020 2232 1040 154 142
Abomonation
abomonation 2342 125000 500000 2183 261 160
Rkyv
rkyv 1605 37037 200000 1531 180 154

Runtime VS Compiled Libs: Some formats require data types to be compiled into the application, which increases performance but means data types cannot change at runtime. If data types need to mutate during runtime or can't be known before the application is compiled (like with databases), you must use a format that doesn't compile data types into the application, like JSON or NoProto.

Complete benchmark source code is available here. Suggestions for improving the quality of these benchmarks is appreciated.

NoProto Strengths

If your use case fits any of the points below, NoProto might be a good choice for your application.

  1. Flexible At Runtime
    If you need to work with data types that will change or be created at runtime, you normally have to pick something like JSON since highly optimized formats like Flatbuffers and Bincode depend on compiling the data types into your application (making everything fixed at runtime). When it comes to formats that can change/implement data types at runtime, NoProto is fastest format we're aware of (if you know if one that might be faster, let us know!).

  2. Safely Accept Untrusted Data
    The worse case failure mode for NoProto buffers is junk data. While other formats can cause denial of service attacks or allow unsafe memory access, there is no such failure case with NoProto. There is no way to construct a NoProto buffer that would cause any detrement in performance to the host application or lead to unsafe memory access. Also, there is no panic causing code in the library, meaning it will never crash your application.

  3. Extremely Fast Updates
    If you have a workflow in your application that is read -> modify -> write with buffers, NoProto will usually outperform every other format, including Bincode and Flatbuffers. This is because NoProto never actually deserializes, it doesn't need to. This includes complicated mutations like pushing a value onto a nested list or replacing entire structs.

  4. All Fields Optional, Insert/Update In Any Order
    Many formats require that all values be present to close the buffer, further they may require data to be inserted in a specific order to accomodate the encoding/decoding scheme. With NoProto, all fields are optional and any update/insert can happen in any order.

  5. Incremental Deserializing
    You only pay for the fields you read, no more. There is no deserializing step in NoProto, opening a buffer performs no operations. Once you start asking for fields, the library will navigate the buffer using the format rules to get just what you asked for and nothing else. If you have a workflow in your application where you read a buffer and only grab a few fields inside it, NoProto will outperform most other libraries.

  6. Bytewise Sorting
    Almost all of NoProto's data types are designed to serialize into bytewise sortable values, including signed integers. When used with Tuples, making database keys with compound sorting is extremly easy. When you combine that with first class support for UUIDs and ULIDs NoProto makes an excellent tool for parsing and creating primary keys for databases like RocksDB, LevelDB and TiKV.

  7. no_std Support
    If you need a serialization format with low memory usage that works in no_std environments, NoProto is one of the few good choices.

  8. Stable
    NoProto will never cause a panic in your application. It has zero panics or unwraps, meaning there is no code path that could lead to a panic. Fallback behavior is to provide a sane default path or bubble an error up to the caller.

  9. CPU Independent
    All numbers and pointers in NoProto buffers are always stored in big endian, so you can safely create buffers on any CPU architecture and know that they will work with any other CPU architecture.

When to use Flatbuffers / Bincode / CapN Proto

If you can safely compile all your data types into your application, all the buffers/data is trusted, and you don't intend to mutate buffers after they're created, Bincode/Flatbuffers/CapNProto is a better choice for you.

When to use JSON / BSON / MessagePack

If your data changes so often that schemas don't really make sense or the format you use must be self describing, JSON/BSON/MessagePack is a better choice. Although I'd argue that if you can make schemas work you should. Once you can use a format with schemas you save a ton of space in the resulting buffers and performance far better.

Limitations

Unsafe

This library makes use of unsafe to get better performance. Generally speaking, it's not possible to have a high performance serialization library without unsafe. It is only used where performance improvements are significant and additional checks are performed so that the worst case for any unsafe block is it leads to junk data in a buffer.


MIT License

Copyright (c) 2021 Scott Lott

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.