CAR format spec doc - Githubissues

rvagg commented 4 years ago

Builds on @mishmosh's work in #229 for a complete spec.

Best reviewed as rendered MD: https://github.com/ipld/specs/blob/rvagg/car/block-layer/content-addressable-archives.md

I'm confident in the technical details, but when reviewing I'd like you to consider the following items that I've codified:

The name, as per discussion in #229, what should it be? I've gone with "Content Addressable aRchive" but am happy to be overridden if there is existing lore about the name I'm not aware of.
The roots don't need to have a strict relationship with the blocks other than them existing somewhere in the CAR. They may describe complete or partial DAGs comprising all blocks in a CAR or they may describe DAGs formed by some of the blocks in a CAR with the rest being disconnected, or there may be no roots at all (as per https://github.com/ipfs/go-car/issues/19). I've done this in my JavaScript CAR implementation but go-car doesn't and perhaps it breaks some original intentions of the format?
An empty CAR is not ruled out, i.e. no roots, no blocks, just a header block: {version:1,roots:[]}. We could rule it out.
Duplicate blocks are not ruled out: "The possibility of duplicate blocks in a single CAR (such as for padding) is currently not specified." They may be useful for easy padding (also discussion of padding included in doc), we've discussed the possibility of padding for lining up blocks to Filecoin window boundaries but haven't gone far down that rabbit hole. Of course padding can also be done without duplicate blocks too, I just don't see a reason to codify strict no-duplicates.

/R @mishmosh @whyrusleeping @dignifiedquire @Kubuxu @Stebalien

rvagg commented 4 years ago

Responding to most of @Stebalien's inline comments here because they revolve around this key point:

CARs need to be, at a minimum, deterministic

That would explain "certified".

In https://github.com/ipfs/go-car/issues/19 we originally discussed not having any roots and allowing it to be a store of arbitrary blocks. The primary motivation was that we want to produce CARs that have a max size containing data for very large data structures (e.g. if we store very large DAGs in Filecoin we're limited by sector size). So, can we have a parent CAR that contains one or more roots and its DAG spans outside of the CAR to blocks contained "elsewhere" (the connection piece of this puzzle is yet to be determined).

If deterministic is a strict requirement for CARs then I'll change a bunch of this spec to make that the case:

mandatory roots
blocks must be part of a DAG described by the roots a. can there be unreferenced parts of a DAG? i.e. can the nodes have links to content outside of this DAG? This would rule out a lot of IPLD data structures.
no modification because there's only one-true-dag for roots
need to describe DAG walking algorithm/order to make sure the blocks go in the same order regardless of codecs and implementation

But that doesn't help with the size problem for very large DAGs.

Or do we make multiple CAR modes? Where we could introduce a "certified" mode that has a fully self-contained DAG starting at its roots, and a "lax" mode that is just a format for storing arbitrary blocks.

mikeal commented 4 years ago

I don’t think it’s all that practical to ensure the format is deterministic without making huge sacrifices in other areas.

For instance, how does the format ensure the blocks are in a particular order? We could add such a requirement, but then anyone who writes to the format won’t be able to stream to it, they’ll need to prepare all the data some other way and then write blocks in an order that ensured to be compliant with the spec.

We need to reconcile the use cases we’re building this for before we make statements like “must” about features we don’t even have yet. There’s a lot of tradeoffs to be made here and the use case that has pushed us to even write up the spec and build another implementation isn’t visible to anyone but those of us currently working on it.

Stebalien commented 4 years ago

Read:

2.a. can there be unreferenced parts of a DAG? i.e. can the nodes have links to content outside of this DAG? This would rule out a lot of IPLD data structures.

That's an interesting question. One solution would be to use selectors. Looks like there was an attempt: https://github.com/ipfs/go-car/pull/18.

But that doesn't help with the size problem for very large DAGs.

Being able to CAR a selector should help.

I don’t think it’s all that practical to ensure the format is deterministic without making huge sacrifices in other areas.

Being able to deterministically create a CAR from a CID (or a selector?) is a hard requirement for systems like Filecoin. Allowing non-canonical/non-deterministic CARs is probably fine.

For instance, how does the format ensure the blocks are in a particular order? We could add such a requirement, but then anyone who writes to the format won’t be able to stream to it, they’ll need to prepare all the data some other way and then write blocks in an order that ensured to be compliant with the spec.

Add blocks to the dag in-order, starting at the root, sorted in link order. As far as I know, we already have this feature.

Stebalien commented 4 years ago

Note: if you need a "bag of blocks" format, it might make sense to create two formats:

A bag of blocks format.
A CAR format, built on top of the bag of blocks format, that specifies a specific, deterministic layout.

hannahhoward commented 4 years ago

Hey IPLDers — we had been assuming in Filecoin land that the CAR would tend toward “simple bag of blocks” and writing the CAR from a root + selector in a deterministic way would be an out of band concern for Filecoin to handle. But that assumes Filecoin does the CARification and the entry point for storing something is a cid + selector & blockstore to read from (which is where we’re headed)

hannahhoward commented 4 years ago

Part of the reason we were headed this way is selector traversal is itself not deterministic and we didn’t want to impose that requirement on selectors in general even if we were going provide a special case when we write car files. I guess though if none of this is in the CAR spec it better be in the Filecoin spec and currently it’s not :)

Stebalien commented 4 years ago

How are we transferring the data? I assumed we'd be shipping the dag as a bunch of blocks (using, e.g., bitswap), then constructing the CAR on the remote (miner) side. If so, we need to deterministically construct it.

hannahhoward commented 4 years ago

@stebalien your understanding of the transfer mechanics is correct and you are right that CARifying absolutely has to be deterministic for filecoin’s case. Our assumption was though because Filecoin would be doing the CARification itself that deterministic construction would be a Filecoin special case. For reference while https://github.com/ipfs/go-car/issues/18 exists my current assumption is we will move this code into Filecoin itself because we may want our writing process to store even more additional information (outside the file itself) like direct byte offsets to a given cid in the car file.

mikeal commented 4 years ago

Add blocks to the dag in-order, starting at the root, sorted in link order. As far as I know, we already have this feature.

Sure, but that’s not enforced by the format, which is what we’re talking about here. Yes, of course different applications will need to build these in a deterministic way, but it’s a stretch to say that it’s the role of the format to enforce that constraint, which it sounds like you aren’t actually recommending based on “Allowing non-canonical/non-deterministic CARs is probably fine.”

The reason this is an important distinction for us is that we are currently writing tools to prepare data to be stored in Filecoin in .car files and there isn’t really a canonical representation of the data (local to the .car file) because the data is so large that a single file is sharded over multiple .car files, so each .car file doesn’t even have a local DAG to determine “ordering.” In order to understand the ordering you need data that is outside that .car file, which means we wouldn’t be able to build something that could validate any ordering at the format layer.

Stebalien commented 4 years ago

@hannahhoward the spec needs to specify how we go from a CID to a CAR, as far as I know. Otherwise, we're going to have compatibility issues.

Stebalien commented 4 years ago

Sure, but that’s not enforced by the format, which is what we’re talking about here. Yes, of course different applications will need to build these in a deterministic way, but it’s a stretch to say that it’s the role of the format to enforce that constraint, which it sounds like you aren’t actually recommending based on “Allowing non-canonical/non-deterministic CARs is probably fine.”

I've just been saying that there must exist a canonical CAR format. That's the only must.

Additionally, it would be really nice if CARs were completely self-certifying (as anything else is not a Certified ARchive by definition).

hannahhoward commented 4 years ago

@mikeal did you see my comment above describing the entry point for the filecoin system?

Filecoin does the CARification and the entry point for storing something is a cid + selector & blockstore to read from

Are you expecting to provide car files directly? The current design of the code assumes data is not already put in CAR and filecoin has control over how that happens.

mikeal commented 4 years ago

@hannahhoward I think we might be talking about different car files being created for different purposes.

We’re preparing large amounts of data to be stored in Filecoin. It’s not really practical for Filecoin to create those .car files, we have to build a lot of custom code in order to take a few PB of data and create the graphs and prepare the .car files, and we have to build all of this to run concurrently because even with 10K running in parallel it takes weeks to prepare the data.

Stebalien commented 4 years ago

@hannahhoward

Off topic: Doesn't graphsync rely on deterministic selector evaluation order for efficient validation?

On topic: How we go from a CID+Selector to a CAR is a critical part of the Filecoin spec. If a client and storage miner don't do the exact same thing, they won't be able to form a deal. If we just say "do whatever the code does" now, we'll have to turn that code into a spec as-is when we finish the spec. That's likely going to lead to a convoluted spec with a lot of edge cases. We can hand-wave network protocols and fix them later but hand-waving piece formation is going to be a problem later.

mikeal commented 4 years ago

Apologies for the large comment but this is potentially controversial and I want to make sure I capture all of the details.

I’d like to unwind/challenge the requirements being surfaced regarding determinism. To be clear, I am not questioning the need for determinism in many places where .car files are used, the need for determinism in these use cases is quite clear. What I don’t think is clear, and highly problematic, is trying to achieve and potentially ensure determinism at the .car file rather than by other means. In fact, I would argue that we already are achieving determinism by other means and that determinism is simply carrying over into the .car files we’re creating.

Currently, the .car file library in Go ensures determinism at creation time by performing a deterministic read operation that writes the .car file in a predictable and consistent manner. It can only ensure determinism by performing a deterministic read-on-write and has no way to do this as a write-only operation. So, just to be clear, we do not currently have or rely upon a guarantee than any .car file we receive is deterministic.

Instead, we rely on deterministic creation of .car files by different nodes. If a node were to generate a .car file in an indeterministic manner the operation it was trying to accomplish with other nodes would simply fail.

Another way to describe this behavior that I think could bring us a lot more flexibility and better guarantees is that we have deterministic graph readers. Or, even better, what we have are deterministic selectors.

We only have deterministic .car files to the extent that we have .car files created from deterministic selectors.

Even outside of .car file usage, selectors already must be deterministic in order to work in a trust-less environment because the client receiving the results of a selector must be able to validate the data it is receiving as a stream without excessive buffering, which requires that the ordering be consistent between different implementations. I recall this being part of the selector/graphsync specification process. We’re about to do some more selector spec work in IPLD and we could easily expand and solidify this.

Some problems with trying to place a deterministic requirement on .car files are:

Lack of a streaming interface. When we encode data to be added to a .car file we must first encode the entire graph and write it into a blockstore which we then read and perform another write operation on in order to output a deterministic .car file. This is a full write and full read of the entire graph we may otherwise be able to avoid.
No way to validate a given .car file is deterministic without a full read.
Assuming we go with the recommendations here and ensure that “a .car file can be deterministically re-constructed given the root node” we have an additional problem.
- We cannot ensure determinism of partial graphs. We can construct selectors for partial graphs but since the requirement is for nothing but a root node to be used in re-construction 🤷‍♂️ Unless we’re willing to write additional data into the .car file about the selector that created a partial graph we won’t ever have a way to validate a .car file without out-of-band information. This is further evidence that we can’t really have deterministic .car files apart from selectors.

By comparison, leaning on deterministic selectors has the following benefits:

We can validate determinism block-by-block throughout any read operation.
We can re-construct the result of any selector deterministically given the selector.
We can ensure these guarantees whenever a read operation is performed.
Given a .car file and a selector we can validate the .car file is a deterministic result of the selector.
We can create better streaming interfaces for .car file creation.

Finally, it’s worth noting that this would require zero code changes. Like I said, this is already what we’re doing, this is really just about how we talk about determinism and where we try to expand and improve these guarantees in the future.

In my view, it’s actually somewhat dangerous to say we have guarantees that we aren’t validating and could quite easily not be true. In the places I’m aware of in Filecoin where determinism is required for .car files it’s only required for specific operations that could be accomplished by other means and deals could easily be created for indeterministic .car files. This makes me worry about assumptions being made about determinism up the stack that aren’t necessarily true. But everywhere that we use a selector, we can ensure determinism, and if we shift our focus to “determinism via selectors” we won’t find later that we assumed there was a guarantee we don’t actually have.

Tying this all back together, my ideal representation of a “deterministic .car file” is a (selector + car file) in some format. The requirements for determinism should migrate to selectors and we should ensure that current and future selector specifications leave no room for ambiguity on issues like ordering.

If Filecoin, or anyone else, wants a deterministic guarantee they should specify both the selector and the .car file (or some other unique identifier for the .car file like CommP). Which, again, is what we’re already doing in practice.

whyrusleeping commented 4 years ago

@mikeal that makes a large amount of sense to me

Stebalien commented 4 years ago

I agree. The only requirement we have is that there exists a well-defined, deterministic way to go from a selector to a CAR. The CAR format itself doesn't have to require or guarantee any particular order of blocks.

hannahhoward commented 4 years ago

+1 to what @mikeal is saying.

In terms of future work:

Having ways to make selector traversal deterministic is important but probably not a universal requirement for selector traversal for all cases. It seems to me describing some options for deterministic traversal in the selector spec would be helpful, and then we could reference what is used in places where determinism is critical.

For example, I believe the current determinism to be:

depth first
explore lists in ascending index order
explore maps in order of the ordinal value of the fields in the map (tracked internally)

While the above is currently the only traversal type used anywhere, it would be good to have it documented in the selector spec, and referenced in the two places it's vital:

Graphsync (should be referenced in graphsync spec probably)
Writing CARs in the context of Filecoin (should be referenced in FC spec)

rvagg commented 4 years ago

I don't want to derail this conversation but I think it's worth noting so we don't leave this snippet as a hand-wave:

explore maps in order of the ordinal value of the fields in the map (tracked internally)

We dove into this discussion today in our meeting, https://github.com/ipld/team-mgmt/blob/8d612e3ce1222de9dd531ca14bdeeb662bd2a3a1/meeting-notes/2020/2020-01-20--ipld-sync.md#notes - map ordering will likely be defined in our strict DAG-* codecs (dag-cbor, dag-json) but will have to be exposed through the encoding in a way that selectors can make use of, which will be interesting in some languages where you can't control natural map ordering (Go for example, but in Go if you want ordering you need a side-channel anyway so this is all being baked into go-ipld-prime anyway).

The best place for further discussion on this, for now, is probably in https://github.com/ipld/specs/issues/227 where I've noted it (it's collecting some CBOR-specific loose vs strict thoughts).

rvagg commented 4 years ago

I've pushed an update that I hope puts this in a place where it can land, see last commit or read rendered @ https://github.com/ipld/specs/blob/rvagg/car/block-layer/content-addressable-archives.md

Summary:

Noted the previously used "Certified ARchive" name, but kept "Content Addressable aRchive" since the move away from strict determinism makes "certified" harder to justify (I think?).
Added a section on Determinism which suggests how it might be achieved but basically says it's beyond the scope of this spec
Added a collection of "Unresolved Items" at the bottom:
- Zero roots - the spec says "zero or more" roots in the header but then also says that it's not strictly resolved that zero is OK
- Zero blocks - likewise the spec says "zero or more blocks", but that's also not strictly resolved
- CID version - there's discussion above about limiting the CIDs in this format (not inside blocks) to CIDv1, that's unresolved but a possible future part of this spec
- Root CID block existence - it's not yet specified whether encoders or decoders should validate that a CID in the roots array actually shows up in the archive, this is a nontrivial burden that maybe we're not prepared to dictate
- Duplicate blocks - like root CID existence, it's not specified whether encoders or decoders should validate the existence of duplicates

If you feel strongly that one of those unresolved items should be resolved then please speak up, otherwise let's treat this as needing final reviews.

vmx commented 4 years ago

CID version - there's discussion above about limiting the CIDs in this format (not inside blocks) to CIDv1, that's unresolved but a possible future part of this spec

I feel strongly about it as I think restricting it to CIDv1 would make things easier and the spec more precise. I also haven't really seen a reason for not restricting it to CIDv1. Though if that again leads to discussions, it could be added in a subsequent PR.

rvagg commented 4 years ago

re CIDs I'm mostly concerned about it being a one-way operation. If you have a datastore with blocks in it and some are CIDv0, the round-trip through a CAR will mean they all get converted to CIDv1. If they are used as string keys then anything trying to reference a block will have to know that it has to up-convert to CIDv1 to fetch it. Is that going to work? tbh I'm not completely in sync with how this will work in practice and if it'll matter; maybe this up-conversion on look-up is being done actively now anyway?

mikeal commented 4 years ago

If you have a datastore with blocks in it and some are CIDv0, the round-trip through a CAR will mean they all get converted to CIDv1. If they are used as string keys then anything trying to reference a block will have to know that it has to up-convert to CIDv1 to fetch it. Is that going to work? tbh I'm not completely in sync with how this will work in practice and if it'll matter; maybe this up-conversion on look-up is being done actively now anyway?

The actual block data won’t be modified so this should be fine. Keep in mind, this is essentially a decision the CAR format is making for internal representation and implementations can handle translation of older (or perhaps some day newer) CID versions.

Requiring CIDv1 adds some extra work to CAR implementors but once that work is done there’s no externally visible transformation because:

CAR file put() operations will transparently convert v0 to v1.
CAR file get() operations will also transparently convert v0 to v1.

This means that when you read block data that references v0 links you can just pass them right off to the CAR file and it’ll pull them out properly. References inside the block to v0 are fine and will remain unchanged. Any external string references to v0 are also fine because they will be converted on get().

The only caveat is that if you parse the entire CAR file’s CID’s and do your own comparison you’ll need to do the conversion yourself as well.

rvagg commented 4 years ago

Right, my point was that I think existing datastores retain CID version during storage, there's no translation right now is there? For instance, if someone goes to ipfs.io/ipfs/Qm.... then under the hood is it doing a CIDv1 translation of that to look in datastore caches already (same on the client end when doing a DHT query)? If not, then it would seem like the CID bytes are being treated strictly as a key, rather than the content address that it is, so are we ready to be enforcing conversion in all of these other code edges? I'm fine with changing it if others feel strongly enough about it, I just want to check that we're not putting the cart before the horse in doing so.

ribasushi commented 4 years ago

if someone goes to ipfs.io/ipfs/Qm.... then under the hood is it doing a CIDv1 translation of that to look in datastore caches already

I believe implementing such dual-lookup this is the actual plan, but can not find the issue I've seen that...

vmx commented 4 years ago

Right, my point was that I think existing datastores retain CID version during storage, there's no translation right now is there?

There are plans that the datastores use only the Multihash and not the full CID: https://github.com/ipfs/js-ipfs/issues/2415.

mikeal commented 4 years ago

It’s my understanding that part of the transition existing data stores are making from v0 to v1 CID’s is moving the storage layer to use the multihash.

We should consider doing the same in a future version of CAR (for block identifiers, not for root node identifiers) but that is probably too disruptive at the moment.

mikeal commented 4 years ago

I had a sync up with @Stebalien and @jbenet last Friday and came to a few conclusions.

Let’s back out everything that is incompatible with what has already been serialized as CAR and ship it. We need a spec for the CAR files in the wild we might have to deal with so let’s get it written down.
Let’s start on a minor improvement to the format with a major version bump that includes:
- v1 CID internals
- an improved/extended header format that will allow us to include a CBOR selector alongside each root CID. when used, this would establish determinism for the CAR file that can be validated on read.

rvagg commented 4 years ago

I've done some minor cleanups in the latest commit. I've also changed the roots array from being, in essence, "zero or more, but we haven't strictly resolved that" to being "one or more, but we haven't strictly resolved that". And while I was at it, I did the same for the number of blocks, "one or more blocks" must be present.

We're leaning a lot on an "unresolved issues" section here, so we'll want to focus on those things for an iteration of this spec. (@ribasushi also raised the possibility of identity CIDs with zero-length blocks, but I'll leave that out of the for now for further consideration).

PLEASE REVIEW all, this is intended to reflect current state, a future PR can evolve to a version 2.

mikeal commented 4 years ago

Sorry for the last minute addition.

We should note that not all implementations properly support multiple roots (filecoin) and that it is highly recommended you only use a single root in this version of the spec.

rvagg commented 4 years ago

@mikeal: see c901366, I've turned the "zero roots" section into a brief rundown of implementations and usage and a recommendation to stick to a single root.

mikeal commented 4 years ago

looks great, let’s get it merged.

vmx commented 4 years ago

I've no idea why this PR doesn't show any merge. But in case someone is wondering, this got merged with commit https://github.com/ipld/specs/commit/71ef2bd7dd721d70eced971a22c26fa19460340d.

ipld / specs

CAR format spec doc #230