ipld / specs

Content-addressed, authenticated, immutable data structures
Other
592 stars 108 forks source link

CBOR: Data model serialization in Rust #323

Open prataprc opened 3 years ago

prataprc commented 3 years ago
    .-----------------.    .------------.    .--------.
    | Serialized data |<-->| Data-model |<-->| Schema |
    ˙-----------------˙    ˙------------˙    ˙--------˙

This writeup focus on the data-model definition in Rust language and its (de)serialization in CBOR format. Draft implementation can be found here,

Data model

enum Kind {
    Null,
    Bool(bool),
    Integer(i128),
    Float(f64),
    Text(String),
    Bytes(Vec<u8>),
    Link(Cid),
    List(Vec<Kind>),
    Dict(BTreeMap<String, Kind>),
}

Additionally,

Cbor serialization

This section is essentially going to carve out a subset of CBOR specification that is required to have as much completeness and as much fittedness as possible, to say it IPLD parlance.

A note on recursion, recursive type naturally fit recursive encoding and decoding implementation. But this can lead to stack overflow issue if data-model values are recursively composed of large number of list and dict values. Similar problem exists when deserializing incoming wire-data. To avoid this, either we have to convert the recursive encoding and decoding implementation into a loop, or we may have to add cap on recursion depth.

Open points:

rvagg commented 3 years ago

F32, deserialization is supported, internally converted to double-precision. & Does it make sense to have a separate f32 and f64 for float-kind ?

You could consider dropping everything but double precision, even on decode. Our ideal is to have just one way, although our current decoders are quite sloppy in what they expect. The JavaScript encoder currently does a smallest-possible float encode while Go does an always 64-bit. We're moving to the latter (as per the recent revision to the DAG-CBOR spec) and we don't anticipate there's a whole lot of DAG-CBOR in the wild with floats anyway so hopefully this isn't a pain for anyone.

My advice on building a new DAG-CBOR implementation is to make it minimal and strict and see how far it gets you before it breaks on real-world data. I would hope that you wouldn't find any breakage and as we move our other encoders to a stricter form the odds will just improve over time.

The i128 might be overkill too? You only get 64-bits in CBOR anyway so maybe it's best to have the API make that clear?

vmx commented 3 years ago

Your Data Model aligns with the one outlines here https://github.com/ipfs-rust/rust-ipld/blob/a2b6f30631bfd45a0ffdc78a9aa7e12c2b809f1e/core/src/ipld.rs#L8. The only difference is naming one things String or Text. Though I kind of like the name Text as this aligns well with how I see that type.

prataprc commented 3 years ago

https://github.com/ipfs-rust/rust-ipld/blob/a2b6f30631bfd45a0ffdc78a9aa7e12c2b809f1e/core/src/ipld.rs#L8.

@volker thanks for this reference. Actually thanks for the many references ;)

I am kind of making this quick draft implementation to understand the specs in finer details. Other than that nothing much.

prataprc commented 3 years ago

You could consider dropping everything but double precision, even on decode.

In situation similar to IoT, where small embedded devices are expected to communicate using DAG-CBOR, do we assume double-precision float on those devices. It might be okay to expect low-end devices falling short in double-precision support, but an adequately powered computing node, ignoring single-precision might be a bad excuse. Other than this I prefer float-64 only design.

before it breaks on real-world data.

Do we have real-world data, say from goland, that we can work with ?

The i128 might be overkill too?

Yes, I definitely feel it is a overkill. Actually thanks for bringing it up. Found this in the CBOR spec,

Major types 0 and 1 are designed in such a way that they can be encoded in C from a signed integer without actually doing an if-then- else for positive/negative (Figure 2). This uses the fact that (-1-n), the transformation for major type 1, is the same as ~n (bitwise complement) in C unsigned arithmetic; ~n can then be expressed as (-1)^n for the negative case, while 0^n leaves n unchanged for non-negative. The sign of a number can be converted to -1 for negative and 0 for non-negative (0 or positive) by arithmetic- shifting the number by one bit less than the bit length of the number (for example, by 63 for 64-bit numbers).

From the above description we understand that, if both Major-type-0 and Major-type-1 are to be used we are encouraged to use it as signed-64. And may be, if only Major-type-0 is to be used it is left to the application choice.

Major type 1: a negative integer. The encoding follows the rules for unsigned integers (major type 0), except that the value is then -1 minus the encoded unsigned integer. For example, the integer -500 would be 0b001_11001 (major type 1, additional information 25) followed by the two bytes 0x01f3, which is 499 in decimal.

But the above description that comes as part of the main part of the spec. is not clear about this. May be it is the bane of designing a serialization spec., we have to keep it a little open-ended.

I will book this as a TODO for now.

rvagg commented 3 years ago

Yeah, some of these details get a bit too in-the-weeds for the data model and we have to be careful not to drive the data model according to one codec's quirks (or one language's quirks!). It's a juggling act of compromise everywhere. If you read https://github.com/ipld/specs/blob/master/block-layer/codecs/codecs-and-completeness.md carefully you'll see that basically nothing we have now is up to standard, and in some places it's not even clear what the standard is (let's not even get in to data model map "ordering" questions ...).

warpfork commented 3 years ago

As a tree of bullet points, forgive me if it's terse:

prataprc commented 3 years ago

making a Node "interface" that is the container for actual values.

Trait in rust is type-system-only concept. But there is something called trait-object that "kind of matches" with interface{}. But trait-object has some limitations.

you might want to have more than one memory layout that implements the Node methods.

I am yet to study the higher-level abstractions in IPLD. But I think I get it.

Or maybe the macro generates the whole type specification.

Sounds like a good idea.

AsString "unboxing" method that returns a (utf-8 strictly) String (but also has to return an error, instead, potentially, if the string contained other non-utf-8 bytes). It can simultaneously have an AsFFIString

+1

previously we had a lot of error returns on our node unboxing methods. We're going to move to panics for most of them, though:

Rust error handling is cooler than most other language. With some glue logic, we can chain error across components and only noise we might see in our functional logic is the ? operator. Having said this, rust's stdlib follow the advice of "use panic where ever it make sense". Though, I had this habit of using Result and err_at!() all the time, I remove them later for better looking API/code. The code change can be safely done.

definitely do not use unicode collation.

+1


Overall I agree to the idea of keeping Kind as an abstract type. Now the implementation has two choice,

a. use type-parameters and define traits that match with the operational behaviour of Kind.

enum Kind<V, M>
where 
    V: Node,
    M: Node,
{
    Null,
    Bool(bool),
    Integer(i128),
    Float(f64),
    Text(String),
    Bytes(Vec<u8>),
    Link(Cid),
    List(V),
    Dict(M),
}

b. using trait-objects.

Leaning towards (a) now. Also keeping Text as String, String's underlying value can be either checked-utf8 or unchecked binary. API can be added on top of kind for utf8 checks, and other type-conversions.

I think I might have to re-visit all this after studying more on IPLD. Thanks for the comments.

prataprc commented 3 years ago

My bad, didn't realize that I will end up with HKT issue with (a), that is, parameterizeKind<V,M> over V, and M. Will have to try (b). But that might mean we have to forego type-parameters on methods in Node.

prataprc commented 3 years ago

After looking at the Node interface{}, able to understand why Kind was suggested to be simple-enumeration and not a sum-type.

/// Kind of data in data-model.
pub enum Kind {
    Null,
    Bool,
    Integer,
    Float,
    Text,
    Bytes,
    Link,
    List,
    Map,
}

IPLD basic data-model can be implemented as,

/// Basic defines IPLD data-model.
pub enum Basic {
    Null,
    Bool(bool),
    Integer(i128), // TODO: i128 might an overkill, 8 more bytes than 64-bit !!
    Float(f64),
    Text(String),
    Bytes(Vec<u8>),
    Link(Cid),
    List(Box<dyn Node + 'static>),
    Map(Box<dyn Node + 'static>),
}

which is now truly a recursive-type, where sub-trees are implemented as Node.

list/map lookup type (key) is implemented as sum-type, this way we can keep it simple and at the same time give scope for future additions.

/// A subset of Basic, that can be used to index into recursive type, like
/// list and map. Can be seen as the path-segment.
#[derive(Clone)]
pub enum Key {
    Null,
    Bool(bool),
    Offset(usize),
    Text(String),
    Bytes(Vec<u8>),
}

Key type implement Eq and Ord traits to address the IPLD's operational behaviour.

Nevertheless, Cbor type still holds on to Vec<Cbor> for list and BTreeMap<String, Cbor> for map. Hope that should be okay.

Finally Node interface{} is defined in Rust as:

/// Every thing is a Node, almost.
pub trait Node {
    /// return the kind.
    fn to_kind(&self) -> Kind;

    /// if kind is recursive type, key lookup.
    fn get(&self, key: &Key) -> Result<&dyn Node>;

    /// iterate over values.
    fn iter<'a>(&'a self) -> Box<dyn Iterator<Item = &dyn Node> + 'a>;

    /// iterate over (key, value) entry, in case of list key is index
    /// offset of value within the list.
    fn iter_entries<'a>(&'a self) -> Box<dyn Iterator<Item = (Key, &dyn Node)> + 'a>;

    /// if kind is container type, return the length.
    fn len(&self) -> Option<usize>;

    fn is_null(&self) -> bool;

    fn to_bool(&self) -> Option<bool>;

    fn to_int(&self) -> Option<i128>;

    fn to_float(&self) -> Option<f64>;

    fn as_string(&self) -> Option<Result<&str>>;

    fn as_ffi_string(&self) -> Option<&str>;

    fn as_bytes(&self) -> Option<&[u8]>;

    fn as_link(&self) -> Option<&Cid>;
}
warpfork commented 3 years ago

I'm probably going to have to look at that repeatedly, but to first glance, I think that looks like it'll fly!

warpfork commented 3 years ago

The Key type idea is really interesting. I'm not sure how it'll go. It's definitely not the same as we've done in golang, but it might work well.

Key probably shouldn't have Null in the sum... I think we've declared those outlawed in the the data model. (Because it would be "mixed kind keys" to have that -- e.g. string-or-null -- and we didn't want to deal with it.) (Possible that the data model spec needs clarification on this, though.)

One more thought I hadn't mentioned earlier in this issue, but realized during other discussions today might be interesting/important... in IPLD Schemas, there is the ability to specify maps with typed keys and typed values. So, in golang, this ended up with us having the MapIterator types having return signiture of (key Node, value Node, maybeIOerror Error) -- the interesting bit being that keys are actually Node cropping up yet again. Having Node here lets us keep a hold of any schema type information when we pass that value around!

The features IPLD Schemas introduced for typed keys go even further: schemas can have maps with complex keys, as long as they fit certain rules. Namely: you can use even structs and unions as map keys on the condition that they have representation strategies which can reduce them to strings (because that means we can still serialize them, when we get that data all the way to the bottom of the stack). So again, having map iterators use Node for keys turns out pretty consequential for this -- it lets us keep even those complex keys intact and typed as we pass them around. (One could, I think, maybe also make this work by just to-string'ing complex keys, and telling the user to "deal with it" if they wanted to use that information later again in a typed context as their complex form (the struct or union or whatnot). But it probably wouldn't be pretty and it probably wouldn't be fast.)

prataprc commented 3 years ago

Having Node here lets us keep a hold of any schema type information when we pass that value around!

Any pointers, say in goland, where to find the schema type information ? May be with some real examples I can understand this better.

prataprc commented 3 years ago

We can add the following methods to the Node interface{}.

trait Node {
    fn as_key<'a>(&'a self) -> Option<Box<dyn Keyable + 'a>>;

    fn as_schema<'a>(&'a self) -> Option<Box<dyn Schemad + 'a>>;

    ...
}

Note that we have created two traits (interfaces) Keyable and Schemad. Subsequently, change the get() and iter_entries() API as,

trait Node {
    fn get(&self, node: &dyn Node) -> Result<&dyn Node>;

    fn iter_entries<'a>(&'a self) -> Box<dyn Iterator<Item = (&dyn Node, &dyn Node)> + 'a>;
}
prataprc commented 3 years ago

Keeping the Kind, Basic types as it is from the earlier comment, Key type and Node trait is updated as follows.

pub enum Key {
    Bool(bool),
    Offset(usize),
    Text(String),
    Bytes(Vec<u8>),
    Keyable(Box<dyn ToString>),
}

Note that we are making the ToString as a hard contraint for Keyable types. Concrete types (struct, union) can implement the to_string() api and can choose to memoize it for repeated calls.

pub trait Node {
    fn as_key(&self) -> Option<Key>;
    ...
}
vmx commented 3 years ago

Please note that the "strings" @warpfork mentions can contain arbitrary bytes. So in Rust it's a Vec<u8> and not a String. Also the current Data Model spec allows for arbitrary bytes in what you call Text.

prataprc commented 3 years ago

@vmx Thanks for bringing it up.

If Text(String) needs to be Text(Vec<u8>), can we use Bytes(Vec<u8>) ?

The way I understood:

vmx commented 3 years ago

If Text(String) needs to be Text(Vec<u8>), can we use Bytes(Vec<u8>) ?

I don't understand that question.

  • Provide two API, one to check and return the previously-assumed-string as utf8-validated-string. Second API is to return the ffi.string as it is and let application (higher layers) validate it as per need.

Yes. I think this is how it would need to be done. I had hoped the Data Model could just define Strings being valid Unicode.

  • While reading the Map keys, read it as utf8-validated-string, if the codec calls it as string. In future we can add support for other kinds as well.

No, those always need to be bytes. There is currently an ongoing discussion on how to exactly specify it in a clean way, but those should really be bytes.

alexjg commented 3 years ago

@warpfork Correct me if I'm wrong here, my understanding is that the primary reason for the Node interface in Go is to allow minimal allocations whilst still allowing deserialization to many implementing types (the 'different memory layouts' you mention)?

I'm not certain that this is necessary in Rust, we could take some inspiration from serde's zero copy deserialization, which achieves the same goal (I think) via lifetimes on the Deserialize trait. I personally feel like it would be worth investigating as I think the Serde style "implement Deserialize<MyObject>" idiom would be much more familiar to Rust programmers and more amenable to automation via custom derive macros.

vmx commented 3 years ago

I'm not certain that this is necessary in Rust, we could take some inspiration from serde's zero copy deserialization, which achieves the same goal (I think) via lifetimes on the Deserialize trait

I think its similar to what rust-ipld is doing here with the Encode and Decode trait: https://github.com/ipfs-rust/rust-ipld/blob/a2b6f30631bfd45a0ffdc78a9aa7e12c2b809f1e/core/src/codec.rs

The DAG-CBOR implementation using those traits is here: https://github.com/ipfs-rust/rust-ipld/tree/a2b6f30631bfd45a0ffdc78a9aa7e12c2b809f1e/dag-cbor/src

You go straight from codec to native types an back.

prataprc commented 3 years ago

Yes. I think this is how it would need to be done. I had hoped the Data Model could just define Strings being valid Unicode.

I am using std::str::from_utf8_unchecked() so I guess it is similar to Text(Vec<u8>). On second thought, Text(Vec<u8>) feels better because Basic is a pub enum and users might directly match on Text(String) which is bad.

If Text(String) needs to be Text(Vec<u8>), can we use Bytes(Vec<u8>) ?

Right now as_string() and as_ffi_string() do a positive match for Text variant. Should we also check for Bytes variant ?

match self {
    Basic::Text(val) => Some(err_at!(FailConvert, from_utf8(val.as_bytes()))),
    Basic::Bytes(val) => Some(err_at!(FailConvert, from_utf8(val.as_slice()))),
    _ => None,
}

While reading the Map keys, read it as utf8-validated-string, if the codec calls it as string. In future we can add support for other kinds as well

No, those always need to be bytes. There is currently an ongoing discussion on how to exactly specify it in a clean way, but those should really be bytes.

I am guessing the key is going to come from upper layers, say via a path-segment. I might visualize that as,

path-segment-key map-key
Arbitrary Arbitrary
utf8-bytes Arbitrary
Arbitrary utf8-bytes
utf8-bytes utf8-bytes

Note that in the above table we can replace utf8-bytes with 64-bit-uint-big-endian-bytes, but the situation is the same, which is - Can we guarantee consistent key index into the same map-value for all 4 combinations ?

We need a precise definition for the key so that incoming key and the map-keys fall within the same value-domain, aka type, aka kind, like utf8-bytes or u64-big-endian-bytes. And when we are dealing with mixed value-domain, we may have to define a sort order across value-domains (if ordering/iteration is to be implemented on map).

Now, it is also possible to have something called collation-encoding which is a form of encoding multiple-types into a seamless collection of bytes. Please note that I am not assuming map as ordered here, just that there is a collation algorithm across multiple kinds (especially those that are defined by data-model) that not only give point-lookup into map, but also can maintain map as an ordered set, if that is required, and it is byte-represented, memcmp-compatible and isomorphic.

vmx commented 3 years ago

BTW: I just re-read the CBOR spec about major types. The CBOR type for strings (major type 3, text strings) must be encoded as UTF-8. So even if the IPLD Data Model currently supports arbitrary bytes, CBOR does not. So in case you only care about CBOR encoding it might change things.

warpfork commented 3 years ago

@warpfork Correct me if I'm wrong here, my understanding is that the primary reason for the Node interface in Go is to allow minimal allocations whilst still allowing deserialization to many implementing types (the 'different memory layouts' you mention)?

I'm not certain that this is necessary in Rust, we could take some inspiration from serde's zero copy deserialization, which achieves the same goal (I think) via lifetimes on the Deserialize trait. I personally feel like it would be worth investigating as I think the Serde style "implement Deserialize<MyObject>" idiom would be much more familiar to Rust programmers and more amenable to automation via custom derive macros.

So, part of the reason for having a Node interface is for those various "go fast" specializations to be possible, yes, but it's not the only one. We also use that interface for generalized traversals (and thus for higher level features like pathing, and more powerfully, Selectors), and we also expect it to be implemented by ADLs (thus getting pathing to work transparently over them "for free", etc), and even the Schema stuff implements it twice -- once for the type-level semantics and separately for the representation-level view of the data.

I don't know exactly how some of this will translate to Rust, because I hear that traits are not quite the same as interfaces in other languages. But it's hard for me to imagine getting good foundations to build towards these higher level features without having some kind of a unifying interface or contract like this, in some way or another.