dag-pb: more accurate schema and up-to-date impl description

rvagg commented 4 years ago

Round 1,490 of DAG-PB specification cleanup..

I'm working on getting this just right in https://github.com/ipld/js-dag-pb/ and am taking the opportunity to be very strict about Data Model representation. We have the challenge of wanting to ensure that we have previously accepted some flexibility in the way the components of a DAG-PB were constructed (friendly API) but now we want to be much more strict - you must give us an object of exactly the right shape as defined by the Schema such that if we round-trip it it'll come out the same way (while also offering utilities on the side to regain some of the friendly API functionality).

Thanks to Hannah's efforts in https://github.com/ipld/go-ipld-prime-proto we have a good chance at full hamonisation of the two implementations! go-ipld-prime-proto has a much more strict interpretation of the Data Model form than the previous Go implementation (which offers the shortcut named pathing form). So I want to try and get that impl and the new JS impl aligned and the Schema here reflecting it correctly. I don't think we quite got it right the last time around, which was hampered by having to conform to the varying legacy implementations and hand-waving about Data Model form. I think we can be much clearer now that we have a clearer concept of the Data Model in the new implementations.

I'm looking mainly at https://github.com/ipld/go-ipld-prime-proto/blob/master/coding.go and its use of https://github.com/ipfs/go-merkledag/tree/master/pb to pull in properties to inform some of the Schema changes, I'd appreciate double-checking on that please. I plan to conform JS to this.

Some additional notes inline.

rvagg commented 4 years ago

Also pushed in an update that adds the sorting order to Links which we've neglected to document before. I don't know if go-ipld-prime-proto does this when encoding, maybe it defers to go-merkledag for it but if it does there might be a non-roundtrippyness problem there where you construct a DAG-PB object in memory, traverse through it assuming it's "correct" but it comes out in a different form when it does a round-trip (same problem in general with strictness in object shapes I'm trying to correct for in JS).

rvagg commented 4 years ago

OK, did some investigation with Go to see what it's doing at the lowest level and that resulted in https://github.com/ipfs/go-merkledag/pull/58 and some winding back of most of the proposals in here. The only real one that changes the schema now is:

Hash optional Link

Plus I've added some notes below with further constraints that can't be expressed with the schema. Particularly that Links: [] is invalid and an empty Links array must be null. There's no difference at the byte level. But it's possible to have zero-length byte arrays in both Data and Name, as well as null for both. I've added additional constraints around nulls and zero-length byte arrays and the validity of the Hash -> CID conversion.

I was going to use go-ipld-prime-proto as a reference but noticed this in the README:

It does not include robust facilities to support actual creation of DAG protobuf nodes

It really is minimal and defers mostly to go-merkledag and has some very rough rules about what it's doing with the data. So what we define here should be the rules used when go-ipld-prime-proto is properly fleshed out so it's worth getting this right.

I've also added a section about a zero-length block being valid. Both current Go and JS implementations will decode this into { Data: null, Links: null }, so it's important for implementers to be aware this is a case they may need to test for.

rvagg commented 4 years ago

ok! I've had my head in this all week and I'm much more comfortable with where this is at now. I've reversed a bit from the previous comment, the schema uses optional for these fields and not a kinded union that includes Null so { Data: null, Links: null } is not valid according to the schema unless we go with a kinded union. In Go there's nil for these fields, but go-ipld-prime has Null as a separate thing, and IsAbsent() can tell you whether it's there or not, it just doesn't have to be part of a graph. In JavaScript we just don't define it (or undefined). So:

The only major change to the schema as making the Hash optional. I don't see a way around this, an empty PBLink is possible in the binary format, who knows how many of those are in the wild. { Links: [{}] } is valid.
I've added a bunch of constraints clarifying things:
- one notable one is that the Hash bytes must be a valid CID. I think that's reasonable, I believe all implementations are treating them as such so even if there are blocks in the wild with bad Hash values, they should be rejected already.
- empty Links array is indistinguishable from an omitted one in the binary format, that's an important clarification (and this is different to an empty Data byte array which is distinguishable from an omitted one).

Implementation stuff:

Go "forms" tests is here: https://github.com/ipfs/go-merkledag/pull/58 - it allows a few things that we should disallow in go-ipld-prime-proto - the zero-length Links case (indistinguishable from no Links) and the non-CID Hash.
New JS implementation is now fully compatible with this spec, it's strict on the way in, and mostly strict on the way out (won't check sorting of Links array is the only bit of non-strictness). It also has a prepare() method that lets you re-shape a sloppy input object into a precise Data Model form, and encode() is super-strict so you won't end up having an in-memory representation that's different from its round-trip version. I also have a mirror of the "forms" tests to validate these edge cases and their binary representations. https://github.com/ipld/js-dag-pb

rvagg commented 4 years ago

OK, after sleeping on it and then talking it through this morning in the meeting and with Mikeal I've decided to even wind back the optional for Hash. So the only real change to the schema text is that it must be a Link, that is, it must be interpreted as a CID and anything else is invalid. I've added this to my notes, that a block should be rejected. Now a [{}] is not a valid Links array.

I've also added that a block with "unrecognized" data should also be rejected. This is a should because most PB decoders will silently accept extraneous data. In go-merkledag there's a XXX_unrecognized property on both nodes to capture it, but any data in there should signal a bad block.

I'm about to go enforce this on the new JS DAG-PB library. Maybe we'll find some source of data out there that has these bad blocks, but until we see them in the wild we can afford to be strict in our newer implementations.

rvagg commented 4 years ago

OK, I'm going with @warpfork's argument about Links cardinality. An omitted one in the binary form will be interpreted as an empty array. In the Data Model the array must always be present.

I've done another pass of the doc and cleaned up the clarifications to the schema. I've also added notes in the Protobuf schema about our recommendations for strict decoding.

https://github.com/ipld/js-dag-pb now matches this. I'm going to work on a Go form too I think now that I have a good test suite but it'll just be a simple replacement to go-merkledag/pb, I don't think I can manage the ipld-prime piece of it and the question of whether to integrate it in ipld-prime-proto is still up for discussion. So I'll just go with the low-level pieces for now.

ipld / specs

dag-pb: more accurate schema and up-to-date impl description #297