multiformat code for CARs

mikeal commented 3 years ago

In nft/web3.storage we accept a lot of CAR files, and I’m starting to worry that we aren’t doing enough to prepare for future version upgrades of the CAR format.

Since we routinely produce multiple CAR files for partial DAGs we end up with a lot of CAR files with the same CID. As a result, we’ve taken to writing the multihash of the CAR file into the database where we track each upload along with the root CID.

It would be much nicer if we could replace the CAR mulithash with a CID that included the CAR version (we’d still store the root CID separately). I can see in the multiformats table that we have entries for some of the extensions we’ve done within the CAR format but not for each complete version of the CAR format. Any objections to adding them?

Often we assume that, if it’s a CID, there’s a reasonable max size for the payload. This would obviously violate that assumption, but the codec is a usable signal to adjust behavior and people do need to hash CAR files so this seems like something we should do.

willscott commented 3 years ago

Seems like a reasonable format / type of CID. agree we should have carv1 and carv2 codecs for when we talk about the multihash of a car file.

aschmahmann commented 3 years ago

I don't understand the value add here. How is this different from someone asking for a code for MP4, AVI, jpeg, BMP, zip, ... file formats?

There are magic bytes at the beginning of the CAR file that tell you what it is (as with various other formats). If you have a database only storing the hashes you can just augment with a signal, either way when the data is fetched the user can know what it is.

Generally, this proposal sounds like #4 unless I'm missing something.

rvagg commented 3 years ago

^ Agreed, this is basically the same as the mime types proposals - which we've been mostly favourable toward adding, we just haven't actually pulled the trigger on it (https://github.com/multiformats/multicodec/pull/159 is close).

For all of these general file type types, we can consider them to be "codecs", in the same way that a filename extension tells you what's in the file and how to open (decode) it. It's a bit uncomfortable if we try and conceive of that strictly within the bounds of what we think of as "IPFS" (where large files as UnixFS would get chunked and the root of a chunked DAG is not going to be the same multihash as the whole thing, nor would it be able to use this multicodec code). But the table is meant to be useful well beyond those bounds, and we're even pushing the definition of CIDs beyond comfortable boundaries (like with the special Filecoin CIDs, or even identity CIDs). I'm personally comfortable with the notion that just because you have a valid CID doesn't mean you can do anything useful with it inside an IPFS, or even arbitrary IPLD, context. It's just a Content IDentifier that's useful for something within your specific system, separate from the actual contents (which I could inspect for magic bytes if I had it in this, and many other cases).

Specifically on CAR mime types:

Maybe we need to push that forward as well as mime types in the multicodec table?

mikeal commented 2 years ago

There seems to be an assumption that these will correspond to IPLD codecs, and they most likely won’t ;)

Multiformats is bigger than just IPLD codecs. We have a thing, that we hash, and we’d like to reference that thing we’re hashing with an identifier. It’s a pretty obvious case for a multiformat :)

rvagg commented 2 years ago

an assumption that these will correspond to IPLD codecs

This is not necessarily a bad assumption when you're using it for a "codec code" in a CID, the primary purpose of that field is to look up the IPLD codec. You're just suggesting abusing that field for other purposes, which is where the disagreement hinges I think - do we care about how users abuse these things for their own purposes or do we try and retain some amount purity? :shrug: I'm pretty relaxed about it, but I'm also fine with making our own tooling hostile to abuses of the standards like this if you show up with your own funky use and expect everything to work.

aschmahmann commented 2 years ago

There seems to be an assumption that these will correspond to IPLD codecs, and they most likely won’t ;)

How did you get here? I was pretty directly comparing these to MIME types and that we should treat them similarly.

It would be much nicer if we could replace the CAR mulithash with a CID

Why do you want to use a CID here instead of <code><multihash>? What do you get out of it being a CID?

Multiformats is bigger than just IPLD codecs

IIUC the main point of putting all the codes in a single table instead of have a multiaddr, multihash, namespace, ipld-codec, mimetype, ... set of tables was so that figuring out what a code meant became easier since there are fewer collisions that can only be resolved through context. So when adding a new code it seems fair game to ask "what is it, what's it for and what category does it belong to". So how does this fit in, it seems an awful lot like a mimetype or just a random file format to me and should get handled similarly.

Another note on having purposes for codes: Perhaps I'm an outlier here, but even the dual purposing of 0x0200 as the IPLD JSON codec and the serialization type seems a bit sketchy to me, although it's not too bad because IPLD codecs can be considered a type of serialization format it still requires more context to disambiguate (e.g. am I expecting an IPLD codec or an arbitrary serialization type in this field).

vmx commented 2 years ago

Why do you want to use a CID here instead of <code><multihash>? What do you get out of it being a CID?

I find this a very good point. For me one of the central points of a CID is, that the multicodec code before the multihash is an IPLD Codec (I know that the spec isn't clear about that, I hope it will be some day). If it is a not a IPLD Codec, it's not a CID anymore. Though of course you can have a thing that could be parsed like a CID. But instead you could just use <code><multihash>, which would even be shorter.

mikeal commented 2 years ago

Why do you want to use a CID here instead of <code><multihash>? What do you get out of it being a CID?

You’re right. I’m sold.

rvagg commented 2 years ago

FYI https://github.com/multiformats/multicodec/pull/258 adds a car entry

Gozala commented 2 years ago

I have updated title of this thread, hope @mikeal does not mind that. And I suggest we continue discussing several threads from https://github.com/multiformats/multicodec/pull/258 in here.

Gozala commented 2 years ago

Quoting @lidel here from the PR

To clarify why I asked, the use case I have in mind is convention where raw and car codecs are used on HTTP Gateway as a way of requesting a single Block or a CAR with blocks for a DAG.

HTTP GET /ipfs/{cid-with-raw-codec} returning a raw Block HTTP GET /ipfs/{cid-with-car-codec} returning a CAR with the entire DAG behind a CID

In this convention the multihash in a CID represents the root block of a DAG, and if you plan to use car with a multihash that has different meaning, we should agree on that now.

And my response here

I love the idea of making gateway capable of export DAGs, but I am concerned about overloading CID codec here because:

CAR may not cover whole DAG (it may contain only subset of nodes)

It may contain nodes from multiple unrelated DAGs.

Same DAG can be represented by different CARs.

More broadly I think it is a mistake to think of CAR as DAG serialization format. Thinking of it as block set serialization seems a lot more accurate to me.

In regards to how we want to use it.

We want to generate CAR multihash by hashing bytes of the file (e.g. with sha256 and tagging accordingly) and than tag that multihash with CAR code. If we tag it with CID version we'd get a CID in a more traditional sense, but again I'm not prepared to have a debate on whether we should identify large things (greater than block size limit) with CIDs or not.

Gozala commented 2 years ago

@lidel @rvagg just to clarify, our current intention is to do what @aschmahmann suggested and what @mikeal agreed

Why do you want to use a CID here instead of <code><multihash>? What do you get out of it being a CID?

You’re right. I’m sold.

That said, I personally think there is a value in turning them into CIDs and packaging CARs as IPLD codecs. That way they can have 1st class representation in IPLD. However that would mean we can have CIDs for things that are greater in size than current block size limit. Maybe that is ok, because you'd know it from the CID.

I'd love to hear your opinions in that regard though.

rvagg commented 2 years ago

My guess is that objections to CID come either from disagreements that this is an "IPLD format", or that this is a CID for a thing larger than a happy libp2p block size (although I've only seen this one from you @gozala, so maybe it's not a live objection?).

I'm on the side of agreeing that CAR can fit the definition of an "IPLD format"—partly because this isn't a definition that we've ratified, but mainly because it's a binary encode format that maps nicely to the data model and also has links. It's even more of an IPLD format than JSON or CBOR, which we've tagged "serialization" in the table, because of the links thing. It's a restricted format, like dag-pb, in that it has a fixed schema, but it yields a map ({version:1,roots:[...]}) and an array of CID:bytes pairs. You can further pass those bytes through another IPLD decoder to yield additional data model forms, so it's similar to the way that UnixFS sits over dag-pb, where UnixFS is a format that we could additionally define as an IPLD format.

But anyway, that's all just to say that I'd not have a problem with CIDs being used to define CARs, we have the CID spec so we can append codec codes to multihashes, so why not use it? CID is a just a versioned <code><multihash>.

There may be a valid concern of people treating these things in the wrong way over libp2p. But what are they going to do? Request them from the DHT and not get an answer? Even if you made one small enough, go-ipfs and js-ipfs won't know what to do with this codec code anyway (although it's interesting to consider the possibility of adding support at that layer ...).

aschmahmann commented 2 years ago

I don't have a particular objection to this being an IPLD codec since basically anything can be an IPLD codec if it can turn bytes into the IPLD Data model (ideally in a way more sophisticated than just an array of bytes 😄).

A couple questions I have here, which have been raised about json and cbor as well, but are more obvious here are: 1) In order to do this there should be a spec for turning the bytes of a CAR file into an IPLD data model representation. I guess we could leave this unspec'd/TODO but that feels like an oversight. a) Note: there are an infinite number of ways that one could do this mapping, although maybe only a few sane ones, the idea here is to just pick one b) Is it weird to have a single code for all CAR formats, given that it already has multiple versions and may gain more over time? It seems strange that some software could think it supports 0x0202 but in fact does not because a new CAR variant has been released 2) My understanding is that the point of putting all the codes into a single table instead of having lots of smaller tables was so that it was more obvious what any particular format was, since it'd save some bytes if we had separate tables for multihash, IPLD codecs, MIME types, namespaces, ... Given this should the same code identifier be reused to indicate both a generic serialization form and a specific rendering of the format into IPLD? No strong feelings on my part, but want to make sure we acknowledge what we're doing here.

These questions, and the stated purpose of the request for a code, made me think this looks basically like a MIME type request and should've gotten treated as such. If folks want to use IPLD codecs for CAR files though then no objections from me as long as we have some answers to the above.

Side comment about block size limits and IPLD - IMO they have nothing to do with libp2p and they are instead related to the feasibility of incrementally verifiable downloads of large blocks of data. I've collected some thoughts on block limits here, which I'd love feedback on.

Gozala commented 2 years ago

In order to do this there should be a spec for turning the bytes of a CAR file into an IPLD data model representation. I guess we could leave this unspec'd/TODO but that feels like an oversight. a) Note: there are an infinite number of ways that one could do this mapping, although maybe only a few sane ones, the idea here is to just pick one

That is one of the reasons why I felt classifying it as "serialization" made more sense right now. I suspect we'll get better idea of what we'd like IPLD data model to look like sometime in the future. I thought once we have a spec that is when it would make most sense to propose changing classification to "IPLD Codec"

I am also suspecting that we may find that instead of general CAR codec we may instead define codec for more constrained CAR variant in which case we may be more interested in defining IPLD codec for that as opposed to CAR codec.

   b) Is it weird to have a single code for all CAR formats, given that it already has multiple versions and may gain more over
   time? It seems strange that some software could think it supports `0x0202` but in fact does not because a new CAR variant has been released

My argument had been that we could have version agnostic CODE along with versioned ones. That way software can choose to support arbitrary car and decode with appropriate version based on CAR header or can choose to support specific version.

So far our use cases mostly had been around "block sets" where CAR version seemed to not matter as long as we can ingest the blocks. That is also why I'm biased towards version agnostic code. If I'm missing something crucial please call out.

My understanding is that the point of putting all the codes into a single table instead of having lots of smaller tables was so that it was more obvious what any particular format was, since it'd save some bytes if we had separate tables for multihash, IPLD codecs, MIME types, namespaces, ... Given this should the same code identifier be reused to indicate both a generic serialization form and a specific rendering of the format into IPLD? No strong feelings on my part, but want to make sure we acknowledge what we're doing here.

This seems ok to me. I think it's reasonable to have an overlap in some cases and not in others. I'm not sure about CAR specific case, but if we end up with canonical IPLD model for CAR I think it would be reasonable. I think it is also reasonable to have alternative IPLD models for the CAR with different codes.

These questions, and the stated purpose of the request for a code, made me think this looks basically like a MIME type request and should've gotten treated as such. If folks want to use IPLD codecs for CAR files though then no objections from me as long as we have some answers to the above.

I hope my answers above provided some clarity no this. More concretely I think we want to evaluate CAR as IPLD codec, but do not want to claim IPLD Codec code until we have more confidence, spec and prototype of what that might look like.

It seemed reasonable to recognize CAR as serialization format and evaluate other ideas from there. If you really fell like it should be designated as "ipld" as opposed to "serialization", I'm happy to send another PR to change that.

lidel commented 1 year ago

I see we have multiple codes now(?)

car (0x0202)
car-index-sorted (0x0400)
car-multihash-index-sorted (0x0401)

Can this be closed?

rvagg commented 1 year ago

yeah, should have been closed with https://github.com/multiformats/multicodec/pull/258

multiformats / multicodec

multiformat code for CARs #239