only-cliches / NoProto

Flexible, Fast & Compact Serialization with RPC
MIT License
374 stars 14 forks source link

Inaccurate claims in readme about competing formats #15

Open kentonv opened 3 years ago

kentonv commented 3 years ago

In the readme, in comparisons against each of Avro, Protobuf, JSON, and Flatbuffers, each comparison lists "Safely handle untrusted data" as a benefit of Noproto. I can't speak for Avro or FlatBuffers, but Protobuf is regularly used to handle untrusted data, and JSON is obviously used by basically everyone to handle untrusted data. So, these bullet points seem incorrect.

Later on, the readme says this:

When to use Flatbuffers / Bincode / CapN Proto

If you can safely compile all your data types into your application, all the buffers/data is trusted, and you don't intend to mutate buffers after they're created, Bincode/Flatbuffers/CapNProto is a better choice for you.

I can't speak for Faltbuffers or Bincode, but Cap'n Proto is explicitly designed for and heavily used in sandboxing scenarios involving unstruted data. In particular, Cap'n Proto was originally built to serve as the protocol that Sandstorm.io uses to communicate with untrusted apps running in secure sandboxes. It is believed to be secure for such use cases.

The statement also seems to imply that Cap'n Proto can only operate on schemas known at compile time. This is incorrect: Cap'n Proto features a mechanism for loading schemas dynamically and using them to operate on types with a reflection-like API. Incidentally, Protobuf features a very similar API. Practical experience has shown that these features are almost never used: people almost always know their schemas at compile time, and the benefits of compile-time type checking and optimizations are large. But dynamic schema loading can be useful in some niche scenarios.

Finally, regarding "mutation": Cap'n Proto actually supports mutation, with caveats. Fixed-width fields can be mutated in-place, but variable-width data will require new allocations which are always added to the end of the message. Removing an object will leave a zero'd "hole" in the data, which can only be reclaimed by making a full-tree copy. In principle, a Cap'n Proto implementation could use a complex memory allocation algorithm that is able to reuse holes in-place for later allocations of same or smaller size, but at present no implementation bothers to attempt this. This problem is generally fundamental to zero-copy formats, due to the need to allocate contiguous memory for each message. If NoProto has solved it somehow, it would be great to see some explanation of how; I couldn't immediately find it in the docs.

only-cliches commented 3 years ago

Thank you so much for reaching out and clarifying all of these points. I will absolutely update the readme in the near future to reflect the correct information.

Regarding safety: I wasn't able to find much documentation on the other formats that went into detail about untrusted inputs, and I looked. You seem to have more experience than me in this area, I'll drop the "untrusted inputs" points from those comparisons.

Regarding Runtime Schemas: can you share the CapNProto & Protocol Buffer docs with me on those? I wasn't able to find any Rust libraries that supported runtime schemas for any format I listed as compile only. Maybe I should clarify in the Readme that there aren't any Rust libraries that support dynamic schemas for these formats, but it is supported and could theoretically be done.

Regarding Mutation: Similar to the schema situation, I wasn't able to find any documentation on Protocol Buffers or CapN Proto regarding mutating data in place. The few places I did find it said it was poorly supported and to avoid it. Everyone says to do a full tree copy and mutate the field you want across the full tree copy. I'd love to see the docs from CapNProto and Protocol Buffers that describe how to perform mutations on existing buffers. This may be another situation where it can technically be done but there aren't any Rust libraries that will do it.

I'm curious how CapN Proto handles mutation with varints being used as addresses. This was one of the big tradeoffs I made with this library, there are no varints, just u32s for addresses. Using CapNProto, If someone wants to perform a mutation that will allocate into an address that is larger than an existing varint address in the buffer can support, how does CapNProto handle that? You'd need to shift all the bits by 1 or more after the address getting updated to account for the larger address. Or create another step of indirection.

I'm sure you won't be surprised by how NoProto handles fixed width object updates. As for variable width, if the previous value was larger the existing space is used. If not, a new memory allocation is created on the end and the "hole" in the data happens. But since ALL addresses are u32 and the lists/map type are created as a linked list in the buffer, many mutations can be performed without creating a hole in the data.

You spoke about needing to perform a full tree copy to reclaim space, this being a natural byproduct of mutation I found it really frustrating that none of the other formats have first class support to automate this process. Writing a "deep copy" method for every buffer type would be burdensome to the developer using the libraries, especially since that code would need to be updated with schema changes. It's one of the big things I felt was missing from other formats when I created NoProto, there is a compaction function that performs a full tree copy automatically for any buffer: https://docs.rs/no_proto/0.9.60/no_proto/buffer/struct.NP_Buffer.html#method.compact

Again thank you for creating this issue and informing me about the mistakes in the ReadMe.

kentonv commented 3 years ago

I wasn't able to find much documentation on the other formats that went into detail about untrusted inputs, and I looked.

I think for many formats this is just sort of assumed -- it's more common to document when it isn't the case. That said, I agree this should be made clearer in all docs. I've seen some cases where implementations were insecure and failed to say so; they usually get a lot of heat for that.

Regarding Runtime Schemas: can you share the CapNProto & Protocol Buffer docs with me on those?

I am not sure about the Rust implementations specifically. I'm more familiar with C++.

Cap'n Proto has Dynamic Reflection and SchemaLoader which loads schemas in schema.capnp format (which can be output by the capnp compiler).

Protobuf has, analogously, DynamicMessage, DescriptorPool and descriptor.proto.

Since, as I mentioned, these features aren't very commonly used, it's typical for some language implementations to omit them. It's possible the Rust implementations haven't implemented these.

Regarding Mutation: Similar to the schema situation, I wasn't able to find any documentation on Protocol Buffers or CapN Proto regarding mutating data in place. The few places I did find it said it was poorly supported and to avoid it.

It's true, Cap'n Proto does recommend against it, for exactly the reasons I mentioned. The caveats make it very quirky.

Protobuf does not support direct mutation of serialized data, but it's not a zero-copy format so this is unsurprising. To mutate a Protobuf you have to parse the whole message and modify the in-memory data structures (which are very much designed for it).

I'm curious how CapN Proto handles mutation with varints being used as addresses.

Cap'n Proto does not use varints; you must be thinking of a different format. Pointers are strictly 64-bit. Roughly speaking, half of the pointer is actually limited type information (specifying the "shape" of the destination object, just enough detail to perform a deep copy), and the other half is the relative address (as an offset from the pointer itself). There is also a notion of "far pointers" which are used to point into different segments. In this case, the initial 64-bit pointer actually points to another data structure of 4 or 8 bytes.

I'm sure you won't be surprised by how NoProto handles fixed width object updates. As for variable width, if the previous value was larger the existing space is used. If not, a new memory allocation is created on the end and the "hole" in the data happens.

This sounds exactly the same as Cap'n Proto, then.

lists/map type are created as a linked list in the buffer, many mutations can be performed without creating a hole in the data.

True, using a linked list would mean you can append to the list without creating holes. Presumably deleting from the list still creates holes though? In Cap'n Proto, you can of course manually create a linked list but it's tedious for the app.

Writing a "deep copy" method for every buffer type would be burdensome to the developer using the libraries, especially since that code would need to be updated with schema changes.

Both Cap'n Proto and Protobuf support deep-copying.

For Protobuf, of course, this operates on the in-memory objects, not directly on the encoded messages. But you definitely don't need to write your own code to get/set each field; the compiler generates one for you. In C++, the method is called CopyFrom().

In Cap'n Proto, the wire format is designed to provide just enough type information so that a deep copy can be performed without knowing the schema. For pointer types, the compiler generates setter methods that can accept as input an instance of that type from a different message buffer; it then deep-copies it.

(Again, I'm only familiar with the C++ implementations, but in this case I would be very surprised if the Rust implementations didn't also support this. It's a pretty important feature.)

tilpner commented 3 years ago

The 16MB size limit claim for BSON sounds odd. Although I haven't used it, the spec looks like a straight-forward tagged document without self-references that might explain a size limit. Perhaps the limit has been taken from MongoDB documentation, and is particular to that implementation?