p2panda / website

Official p2panda website
https://p2panda.org
Creative Commons Attribution Share Alike 4.0 International
88 stars 8 forks source link

Valuable Value (VV) as operation encoding format instead of CBOR #201

Closed adzialocha closed 9 months ago

adzialocha commented 2 years ago

Use Valuable Value as a new encoding format for p2panda operations instead of CBOR.

CBOR

Con

CBOR is a great encoding format but it comes with a couple of issues which are a problem for p2p protocols where deterministic hashing is a thing:

  1. https://github.com/p2panda/p2panda/issues/399 - CBOR has a "canonic format" which encodes and decodes CBOR maps only with ordered keys and other things (float representation, smallest integer representation, ..) but it seems like noone implemented that or only parts of it
  2. https://github.com/p2panda/p2panda/issues/395 - CBOR has a "strict format" which forbids duplicate keys in maps but it seems only minicbor takes care of that

The first point is not breaking anything for us - but its a bummer. Also, it will mean that if we start having something like test vectors etc. (for example as we already start having with send-to-node) it will maybe not work for different p2panda implementations, as the hashes will result in something else.

The second point is horrible and needs to be fixed if we want to stay with CBOR because it will otherwise lead to different operation values being considered during materialization if an operation contains duplicate map keys.

In any case, will not be able to fulfill the canonic format of CBOR as it would be too much work to implement an decoder ourselves, this basically means: We can not make sure that encoded p2panda operations will result in the same hash across implementations with different frameworks and programming languages and that might be true for the future as well ..

So, the general situation is: We have to work around the lack of a good CBOR implementation by implementing some checks after decoding and basically forget about deterministic hashing across implementations when sticking with CBOR.

Pro

Its a very widely adapted format where there are implementations across different platforms. Developers will have it easier to pull in a CBOR package to start implementing p2panda in their languages.

We get CDDL!

We already have it ..

VV

Con

VV is obviously not widely adapted so if its not given in a certain programming language people will need to implement it themselves first .. we can at least make sure that we provide the Rust (there is an implementation already) and TypeScript exports inside of p2panda-js for it.

The Rust implementation still requires nightly and doesn't support the canonic format yet, both can be implemented and changed by @AljoschaMeyer though.

With using VV we also loose support of CDDL (schema validation for CBOR), which will bring us to implementing schema checks ourselves. That might be an opportunity though as we a) don't like the CDDL error responses anyhow and we could use this as a chance to get really nice ones b) its not too complicated as we have very simple types so far which are very permissive. Downside is still that again developers would need to implement the schema validation themselves ..

Pro

There are almost no options for a deterministic encoding format for p2p protocols .. there is bencode and https://github.com/diem/bcs but both do not support floats. Maybe some "pioneer work" is required? :-D

With VV we definitely have a future-proof path towards a canonic format which is as efficient as CBOR but much simpler and stricter. Through this we will be able to get deterministic hashes with almost no additions for p2panda (except of checking the ordering of arrays, but this is required in any case as this is a p2panda specification). We might not get it instantly (it still needs implementation) but there is a path (starting with using the compact format first and then upgrade to canonic later).

The specification is easy to read and much simpler than CBOR as it doesn't try to be the format for everything. There is already an implementation in Rust: https://docs.rs/valuable_value/latest/valuable_value/.

VV comes in a human-readable (.vv) and machine-parseable "compact" format (.cvv), that could be interesting for storing operations in files. Of course we can already do this with serde and some intermediary structs in send-to-node but having the option for a native decoder for the human readable version is nice (you throw in a .vv file and it natively directly encodes to the compact format).

We might loose CDDL but there is a future with VDL: https://github.com/AljoschaMeyer/vdl

Speed speed speed! Its probably faster to use VV than to use CDDL + CBOR + Workaround checks due to the limits of CBOR + our own regular checks than VV + Our own regular checks + Slim schema checks.

The used bytes are similar or even the same in VV in comparison to CBOR, so not more data, but also no slimmer operation sizes.

Implementation

Throwing VV into the mix will be easy as it is just swapping out the ciborium bits with the valuable-value crate. The harder part is replacing the cbor! helper macros we use here and there with something else but even thats doable / a straight forward refactoring.

The largest amount of work will be writing the schema validation logic which doesn't need to be so big though: 1. We check if the fields exist 2. Compare the VV type with the claimed one from the schema 3. Do a regex when it is a relation.

Since we haven't integrated schema validation yet, this might be a good timing to avoid larger refactorings.

AljoschaMeyer commented 2 years ago

[Pro CBOR]: Its a very widely adapted format where there are implementations across different platforms. Developers will have it easier to pull in a CBOR package to start implementing p2panda in their languages.

If they can actually simply pull something in. SSB is in a similar situation: it's "just JSON", yet most re-implementation attempts fizzle out when people realize that they have to handwrite their own parser regardless.

there is bencode and https://github.com/diem/bcs but both do not support floats.

FYI: bcs is not schemaless, so you couldn't use it for user data.

The specification is easy to read

<3

having the option for a native decoder for the human readable version is nice (you throw in a .vv file and it natively directly encodes to the compact format)

That's definitely a CLI utility I'd like to exist anyways. I've put this off so far because I wanted to wait until I've defined VVM - valuable values with a module system. The basic job of a VVM implementation is to read a VVM file, which allows assigning names to VVs and defining them in terms of each other, and output a "resolved" VV (in the encoding of the user's choice). Converting between different encodings would be a special case of that. We should definitely chat about my module system plans on Friday =)

Since we haven't integrated schema validation yet

I'm happy to be part of VDL implementation efforts. Unlike VV, it might be important to wait for VDLM (VDL + module system, names, recursion, generics, all the good stuff) for this. Then again, using VDL first and upgrading later might be possible. VDLM is needed for specifying the types of documents that refer to other documents with (possibly mutually) recursive types however.

The Rust implementation still requires nightly and doesn't support the canonic format yet,

We should definitely chat about the possibility of having a simple and an efficient canonic encoding, and about which one p2anda would prefer to use.

adzialocha commented 2 years ago

I'm happy to be part of VDL implementation efforts. Unlike VV, it might be important to wait for VDLM (VDL + module system, names, recursion, generics, all the good stuff) for this. Then again, using VDL first and upgrading later might be possible. VDLM is needed for specifying the types of documents that refer to other documents with (possibly mutually) recursive types however.

Super, thats really nice! For now I can imagine we will be very fine with implementing the schema checks ourselves as it is not complex. In this sense its not a priority. But with VDL there could be a nice future path to a p2panda specification where we supply the VDL files to the community so they can pull these schema files into their code bases.

The Rust implementation still requires nightly and doesn't support the canonic format yet,

We should definitely chat about the possibility of having a simple and an efficient canonic encoding, and about which one p2anda would prefer to use.

Cool, I have some thoughts, will write them in the regarding issue :+1:

AljoschaMeyer commented 2 years ago

Super, thats really nice! For now I can imagine we will be very fine with implementing the schema checks ourselves as it is not complex. In this sense its not a priority. But with VDL there could be a nice future path to a p2panda specification where we supply the VDL files to the community so they can pull these schema files into their code bases.

I think I am misunderstanding something: I assumed that users would provide CDDL/VDM definitions for their custom document types. For that it would be necessary to dynamically handle those definitions. What you just wrote lookssmall like you are using CDDL/VDM for validating types that already statically known to you, i.e., those for which I am trying to talk you out of using any schemaless encoding in the first place. Which of these is the actual case (or both), could you clarify that for me please?

adzialocha commented 2 years ago

Super, thats really nice! For now I can imagine we will be very fine with implementing the schema checks ourselves as it is not complex. In this sense its not a priority. But with VDL there could be a nice future path to a p2panda specification where we supply the VDL files to the community so they can pull these schema files into their code bases.

I think I am misunderstanding something: I assumed that users would provide CDDL/VDM definitions for their custom document types. For that it would be necessary to dynamically handle those definitions. What you just wrote lookssmall like you are using CDDL/VDM for validating types that already statically known to you, i.e., those for which I am trying to talk you out of using any schemaless encoding in the first place. Which of these is the actual case (or both), could you clarify that for me please?

Yeah, that part is not easy to understand as there is also quite a bit of overlap in terminology. First of all, there are two chapters in our (WIP) handbook which might help:

  1. Operations https://p2panda.org/handbook/docs/writing-data/operations
  2. Schemas https://p2panda.org/handbook/docs/writing-data/schemas

The general thing is: Operations are Bamboo entry payloads which describe a CRDT data type. Multiple operations can form a (multi-writer) graph and together can be resolved into documents.

Operations have two areas: The "header" format and the "fields" format. The first is "meta data" we need to have an CRDT .. it defines the version, the hashes of the previous_operations forming the graph and the schema .. which is the document view id of the regarding schema document (see below).

Developers can publish operations using a schema of a special "schema definition" kind, which describe "Schema documents". Nodes will materialize the schema documents automatically. These documents can now be used to validate any incoming operations which claim that schema. So we have two validation steps: 1. Check that the general operation header format is correct 2. Check if that claimed schema is fullfilled.

Here a diagram:

1. Create "schema_field_definition_v1` operation defining a field named "name". This will result in a schema field definition document.
2. Create a "schema_definition_v1" operation defining a schema named "venues" where it relates to that just create schema field definition document. This will result in a schema definition document.
3. The node will merge these two documents into the "venues" schema with a field named "name".
4. Everyone else can now create an "application" operation using the "venues" schema. Where they write data ala:

{
  action: create
  version: 1
  schema: "venues"
  fields: {
    name: "REDACTED" // edit by Aljoscha: careful, names have power
  }
}
adzialocha commented 2 years ago

The "header" format is the statically known part of the operation, the "fields" format is the dynamic one :+1:

cafca commented 2 years ago

The reason why bencode doesn't support floats is actually this drive-by comment on a mailing-list? :o

https://lists.ibiblio.org/pipermail/bittorrent/2004-August/001040.html

Because I didn't find any other discussion of the issue. Astonishing.

cafca commented 2 years ago

What about Cap'n Proto? I don't understand whether it's possible to use it for encoding varying schemas as we do.

AljoschaMeyer commented 2 years ago

The reason why bencode doesn't support floats is actually this drive-by comment on a mailing-list? :o

Yeah, it took me a long time to settle on restricting VV to a single NaN value so that it makes sense for equality to be an equivalence relation. There's just no great solution, at some point you have to settle on how to deal with the fact that the IEEE 754 authors didn't care about algebraic laws. I completely respect not having floats to avoid that problem; I strongly considered going that way myself.

Rust has the PartialEq trait only because of float weirdness I imagine, I haven't seen the concept of equivalence relations without reflexivity anywhere else.

cafca commented 2 years ago

What about Cap'n Proto? I don't understand whether it's possible to use it for encoding varying schemas as we do.

So I looked into it and it is not. But I am very excited about this other thing I found, no_proto, which is a zero-copy deserialization format and it even has wasm support! And f64! It also has support for schemas, even dynamic schemas.

adzialocha commented 2 years ago

What about Cap'n Proto? I don't understand whether it's possible to use it for encoding varying schemas as we do.

So I looked into it and it is not. But I am very excited about this other thing I found, no_proto, which is a zero-copy deserialization format and it even has wasm support! And f64! It also has support for schemas, even dynamic schemas.

Performance wise this sounds cool. Is there something about canonic encoding (that's the starting point of this issue) and some sort of specification others can implement this encoding with?

Ah, this looks like it, but not really official: https://docs.rs/no_proto/latest/no_proto/format/index.html

cafca commented 2 years ago

I just had this other wild thought in the shower: What if we make the schema id part of the bamboo entry (or write it to the first bytes of the operation) and then let schemas define an encoding for their operations?

adzialocha commented 2 years ago

I just had this other wild thought in the shower: What if we make the schema id part of the bamboo entry (or write it to the first bytes of the operation) and then let schemas define an encoding for their operations?

lol, that's slightly offtopic 🤣