car: CARv2 spec - Githubissues

rvagg commented 4 years ago

WIP for discussion.

Highlights:

The v1 "header" block becomes a "version" block in v2, a CBOR encoded {version:2}—we can exact match the first 11 bytes to determine if it's a v2 (0x0aa16776657273696f6e02—essentially our CARv2 file magic number), otherwise fallback to decoding as per v1. CARv1 code should read this block and bork on version:2.
New "header" block is the first block after the version block, properly CID-prefixed, contains:
- a deterministic flag
- an array of CID + selector pairs

Deterministic CARs have to exhaustively describe the contained DAGs via the roots + selectors (the selectors can describe boundaries of partial DAGs but they must do so exhaustively). Therefore the CID of the header block for a deterministic CAR should uniquely identify that graphs it contains.

Non-deterministic CARs are lax, anywhere from the deterministic exhaustiveness to just-a-bundle-of-blocks with no roots. Where roots are present, selectors are optional.

A CARv1 would be read as a non-deterministic CARv2 with absent selectors.

vmx commented 4 years ago

I had a quick read and not really something to add as I'm not that deep into CAR files. Though it's sad to see that we can't do a [ varint | CID | dag-cbor block ] * that also includes the header. due to not being able to distinguish Version 1.

mikeal commented 4 years ago

This is a good way to approach the initial discussion, but I think that the next version of the spec should probably be either a separate file or a separate sub-section so that we can preserve the original spec. (I read the diff wrong, reading the rendered new file is much clearer)

mikeal commented 4 years ago

Do we want to try adding the index as a trailer?

rvagg commented 4 years ago

@ribasushi

At that stage isn't it simpler to just have proper CBOR:
type DataFrame struct {
  Cid bytes
  Data bytes
}

This is an interesting proposal. There's two paths here -- cbor or dag-cbor. In both cases you you'd end up replacing varints with CBOR's ints describing lengths, prefixed with type data. If we used plain cbor we'd need to specify a smallest-representation rule (as per dag-cbor https://github.com/ipld/specs/pull/236).

For plain cbor you'd only add an extra byte for an average CID (more for long hashes) over using a varint. For dag-cbor you'd get an additional 3 bytes to properly represent it (tag, plus length, plus weird 0x0 lead byte). The data block would depend on how much data there is but I'd expect that you'd only be adding 1 or possibly 2 bytes for average block sizes.

A benefit is that you simplify the decode for new languages. They just need a cbor parser. Another benefit is that we can lean on our existing materials (like schemas) making this more part of the IPLD family in a doc & spec sense rather than such a special snowflake.

The complication for efficient implementations is that you'd want to do this streaming and partial so we'd end up rewriting the basics of major type 2 cbor parsing, which, tbh, isn't too much beyond what we do now, but at least we have easy varint libraries to lean on.

Edit: will also require a type lead-in, list or map, list being the most efficient, so representation tuple but that's an extra byte at the front of each "frame".

warpfork commented 3 years ago

Relatedly: there's now also work on a new format called "dar", which I think had a lot of the same concerns as discussed here: https://github.com/anorth/go-dar

mikeal commented 3 years ago

I’m still considering all the details, but my initial reaction to DAR is that it’s a better approach. In trying to iterate on the existing CAR format we don’t have a lot of opportunities for efficiency gains.

DAR seems to do a nice job of introducing constraints we don’t have in CAR that then provide space savings and assurances while also adding in a few useful features.

Keeping in mind that we already have the existing CAR format, so use cases that cannot live within the constraints of DAR will still have a format they can rely, I think we may want to put off iterating on CAR in favor of DAR or something similar.

mikeal commented 3 years ago

I’ll also add that, given how much we’ve already adopted the existing CAR format w/ recommended constraints (single root only) the utility of an iteration that ads the complexity of detecting and potentially supporting two versions of the CAR format may be worse than just adding an entirely new format. It don’t think we gain anything by having two version of CAR instead of two separate but complimentary formats.

warpfork commented 3 years ago

What can we do with this PR?

My concerns and goals today would be:

get the open PR count in this repo down
document whatever we learned -- by whatever means necessary
bonus/stretch goal: slowly anneal towards understanding what's up in the discussion of {car} vs {carv2} vs {carbs} vs {dar}.
- (figuring out how to land this PR should probably not get hung up on this though; bigger topic / multiple topics.)

I'm not actually sure how to wrangle this one, at a glance. First of all -- are we still actively working on it? (It's not that stale; I'm just on a close-a-thon today.)

Here are some options for moving forward I can think of (but maybe there are also other better ideas? I'm receptive):

Could we extract this content into an exploration report, and merge it thusly?
- (always an options, but might not be fun in this case: there's a good amount of content here, and it's intermixed with other existing content, so this suggests a lot of rewriting.)
Should we merge it almost entirely as-is, but maybe add some disclaimers about how widely implemented the v2 is as of today?
- (I'm not actually sure how widely implemented the v2 is! So someone else's input here will be critical :))
Is this a candidate for the "merge-ignore" change management strategy -- where we "merge" the commits (so they're referencable in git history) but don't actually apply the diffs to the content as it appears in master?
- (this is a very non-committal approach. Doesn't mean we can't resume the work later; if we do, we just cherry-pick the content and keep going. It just results in fewer open branches in the meanwhile.)
... or is this actually still actively in flight, and I should hush myself on pushing for merge?

rvagg commented 3 years ago

Not being actively worked on, should probably not be merged, this really is more like an exploration report than anything else. Focus has moved on to ways to augment CARv1 (CARBS, etc.) and entirely different approaches to a CARv2 (i.e. DAR). Nobody is actively exploring improving the existing format in-place and I don't see work on the horizon. Although this PR and discussion does contain some potentially useful tidbits for any future work to learn from.

rvagg commented 3 years ago

I just wrote this elsewhere and wanted to record it somewhere more permanent, it clarifies the versioning statements in OP:

The problem we have with iterating on versions and doing any kind of successful differentiation from CARv1 is the instability of the leading bytes. The header in CARv1 is: [varint][DAG-CBOR({roots:[CID,...],version:1})]. And that initial varint is not reliable in any way except that it's a varint. The DAG-CBOR bit following it is a bit more reliable because it starts with something like: MAP(2){STRING(5,"roots"),ARRAY(... and it breaks down from there. So that gives you some form of magic bytes that you'll most likely see, 0xa265726f6f7473 but that's after the varint. The risk that this presents is that if you want to have clear differentiation then your leading magic bytes need to be clearly distinguishable from anything that suggests CARv1.

In this PR I suggested just making the magic bytes 0x0aa16776657273696f6e02 because it's [varint(10)][DAG-CBOR({version:2})] and the existing CAR readers can read that and bork at "oh, this is version 2, I don't know how to deal with this (it also has missing roots, but we could add in an empty roots array to be safer for decoders that assume it's there). Then version detection is much more straightforward, you could read the first 11 bytes and very safely determine whether you have a v2, then fall back to v1 parsing. What's more, there's a straightforward path to v3, it's just 0x0aa16776657273696f6e03, so just that 11th byte is different, nice and stable and CARv1 decoders will error meaningfully ("Don't know how to decode version 3").

0x0aa16776657273696f6e02 is first 0x0a which is just uint(10) because varints this small are just ints. Then followed by 10 bytes of CBOR:

a1                                                # map(1)
  67                                              #   string(7)
    76657273696f6e                                #     "version"
  02                                              #   uint(2)

Of course we could just abandon version sniffing from leading-bytes, but that's not very multiformats is it!

/cc @raulk

raulk commented 3 years ago

@rvagg Makes sense to me. Question:

Is it possible for CARv2 and subsequent versions to extend that initial header structure for additional purposes, as long as they preserve the version field. Or are we constraining this initial header and convicting it to ever only being used as a version discriminator, leaving it up to the version itself to specify the header?
- e.g. CARv2 introduces the concept of "characteristics" that express traits about the CAR and the DAG, e.g. whether an index follows, whether the DAG is order depth-first, whether it's deduplicated, etc.

willscott commented 3 years ago

My understanding is that the initial magic bytes is a signal of 'this is not a carv1' (and is in fact carv2) - and indicates to expect a carv2 metadata header with the characteristics to follow. If we end up wanting features that need to change that second header in a non-compatible way, we would consider changing the magic bytes at the beginning to indicate we're now on carv3

raulk commented 3 years ago

Understood. Comparing the 0x00 approach with the magic number, what the latter buys is a nicely readable error message at the expense of 10 extra bytes, right?

willscott commented 3 years ago

Understood. Comparing the 0x00 approach with the magic number, what the latter buys is a nicely readable error message at the expense of 10 extra bytes, right?

and a static 8 byte magic header for mime type detection.

rvagg commented 3 years ago

It might be best if we just view the CARv1 header as a minor mistake in that it didn't lead with a version (like we do with everything else multiformats style) but baked it into the block itself, which leaves little wriggle room for changing it around. Sorting order in DAG-CBOR gives us version coming last in most cases unless we want to have longer property names. We could break that rule for this to make version sniffing easier, but it all gets very messy very quickly. So if we just treat <varint length>dag-cbor({version:X}) as our multiformats-style leading magic bytes then we get to do whatever we want after that and can break from version to version (e.g. version 3 could be 0x0aa16776657273696f6e03 etc.). It's an unfortunately long series of magic bytes but we won't be alone and maybe we can be OK with that and write it off as an accident of history?

mikeal commented 3 years ago

We’ve considered the header a mistake for some time.

If we’re going to break backwards compat on the header then we could just drop the header and make the format begin with a registered multicodec identifier followed by the cid + block data for the “new header” format. That would be “better” in every respect other than backwards compatibility.

willscott commented 3 years ago

We’ve considered the header a mistake for some time.

If we’re going to break backwards compat on the header then we could just drop the header and make the format begin with a registered multicodec identifier followed by the cid + block data for the “new header” format. That would be “better” in every respect other than backwards compatibility.

I think we end up with much of the same complexity that we've spent the last month hashing out here. I'm hesitant to go back to redesign now since that was considered as an option previously, but rejected when we decided to do the least work we could to embed the carv1 as-is with an attached index.

We've gotten to a fairly straight forward fixed-length header prepended to a carv1. We get versioning. we can signal that an index is appended. that's what we need now. The goal was not to envision the format we actually want, which we've explicitly pushed back to carv3.

ipld / specs

car: CARv2 spec #248