Closed rvagg closed 4 years ago
Also pushed in an update that adds the sorting order to Links
which we've neglected to document before. I don't know if go-ipld-prime-proto does this when encoding, maybe it defers to go-merkledag for it but if it does there might be a non-roundtrippyness problem there where you construct a DAG-PB object in memory, traverse through it assuming it's "correct" but it comes out in a different form when it does a round-trip (same problem in general with strictness in object shapes I'm trying to correct for in JS).
OK, did some investigation with Go to see what it's doing at the lowest level and that resulted in https://github.com/ipfs/go-merkledag/pull/58 and some winding back of most of the proposals in here. The only real one that changes the schema now is:
Hash optional Link
Plus I've added some notes below with further constraints that can't be expressed with the schema. Particularly that Links: []
is invalid and an empty Links
array must be null
. There's no difference at the byte level. But it's possible to have zero-length byte arrays in both Data
and Name
, as well as null
for both. I've added additional constraints around nulls and zero-length byte arrays and the validity of the Hash
-> CID conversion.
I was going to use go-ipld-prime-proto as a reference but noticed this in the README:
It does not include robust facilities to support actual creation of DAG protobuf nodes
It really is minimal and defers mostly to go-merkledag and has some very rough rules about what it's doing with the data. So what we define here should be the rules used when go-ipld-prime-proto is properly fleshed out so it's worth getting this right.
I've also added a section about a zero-length block being valid. Both current Go and JS implementations will decode this into { Data: null, Links: null }
, so it's important for implementers to be aware this is a case they may need to test for.
ok! I've had my head in this all week and I'm much more comfortable with where this is at now. I've reversed a bit from the previous comment, the schema uses optional
for these fields and not a kinded union that includes Null
so { Data: null, Links: null }
is not valid according to the schema unless we go with a kinded union. In Go there's nil
for these fields, but go-ipld-prime has Null
as a separate thing, and IsAbsent()
can tell you whether it's there or not, it just doesn't have to be part of a graph. In JavaScript we just don't define it (or undefined
). So:
Hash
optional. I don't see a way around this, an empty PBLink
is possible in the binary format, who knows how many of those are in the wild. { Links: [{}] }
is valid.Hash
bytes must be a valid CID. I think that's reasonable, I believe all implementations are treating them as such so even if there are blocks in the wild with bad Hash
values, they should be rejected already.Links
array is indistinguishable from an omitted one in the binary format, that's an important clarification (and this is different to an empty Data
byte array which is distinguishable from an omitted one).Implementation stuff:
prepare()
method that lets you re-shape a sloppy input object into a precise Data Model form, and encode()
is super-strict so you won't end up having an in-memory representation that's different from its round-trip version. I also have a mirror of the "forms" tests to validate these edge cases and their binary representations. https://github.com/ipld/js-dag-pbOK, after sleeping on it and then talking it through this morning in the meeting and with Mikeal I've decided to even wind back the optional
for Hash
. So the only real change to the schema text is that it must be a Link
, that is, it must be interpreted as a CID
and anything else is invalid. I've added this to my notes, that a block should be rejected. Now a [{}]
is not a valid Links
array.
I've also added that a block with "unrecognized" data should also be rejected. This is a should because most PB decoders will silently accept extraneous data. In go-merkledag there's a XXX_unrecognized
property on both nodes to capture it, but any data in there should signal a bad block.
I'm about to go enforce this on the new JS DAG-PB library. Maybe we'll find some source of data out there that has these bad blocks, but until we see them in the wild we can afford to be strict in our newer implementations.
OK, I'm going with @warpfork's argument about Links
cardinality. An omitted one in the binary form will be interpreted as an empty array. In the Data Model the array must always be present.
I've done another pass of the doc and cleaned up the clarifications to the schema. I've also added notes in the Protobuf schema about our recommendations for strict decoding.
https://github.com/ipld/js-dag-pb now matches this. I'm going to work on a Go form too I think now that I have a good test suite but it'll just be a simple replacement to go-merkledag/pb, I don't think I can manage the ipld-prime piece of it and the question of whether to integrate it in ipld-prime-proto is still up for discussion. So I'll just go with the low-level pieces for now.
Round 1,490 of DAG-PB specification cleanup..
I'm working on getting this just right in https://github.com/ipld/js-dag-pb/ and am taking the opportunity to be very strict about Data Model representation. We have the challenge of wanting to ensure that we have previously accepted some flexibility in the way the components of a DAG-PB were constructed (friendly API) but now we want to be much more strict - you must give us an object of exactly the right shape as defined by the Schema such that if we round-trip it it'll come out the same way (while also offering utilities on the side to regain some of the friendly API functionality).
Thanks to Hannah's efforts in https://github.com/ipld/go-ipld-prime-proto we have a good chance at full hamonisation of the two implementations! go-ipld-prime-proto has a much more strict interpretation of the Data Model form than the previous Go implementation (which offers the shortcut named pathing form). So I want to try and get that impl and the new JS impl aligned and the Schema here reflecting it correctly. I don't think we quite got it right the last time around, which was hampered by having to conform to the varying legacy implementations and hand-waving about Data Model form. I think we can be much clearer now that we have a clearer concept of the Data Model in the new implementations.
I'm looking mainly at https://github.com/ipld/go-ipld-prime-proto/blob/master/coding.go and its use of https://github.com/ipfs/go-merkledag/tree/master/pb to pull in properties to inform some of the Schema changes, I'd appreciate double-checking on that please. I plan to conform JS to this.
Some additional notes inline.