multiformats / multicodec

Compact self-describing codecs. Save space by using predefined multicodec tables.
MIT License
336 stars 201 forks source link

feat: add code for car serialization format #258

Closed Gozala closed 2 years ago

Gozala commented 2 years ago

Add a code for CARs so that in .storage services we could tag multihashes with

rvagg commented 2 years ago

oh, and do we need a v1 and v2 here? we can differentiate once we get the bytes, but do we need to know up front where multicodecs get used?

Gozala commented 2 years ago

oh, and do we need a v1 and v2 here? we can differentiate once we get the bytes, but do we need to know up front where multicodecs get used?

For our use cases that seems irrelevant, as long as we can identify the version from the bytes.

I suggest we go with generic car code and if we find that capturing version is important we could add version specific entries as well.

vmx commented 2 years ago

If this code is intended to be used in a CID as multicodec-content-type (this is what the spec currently calls it), then it should be ipld and not serialization. I think there is agreement that only IPLD formats should be there and we should update the CID spec to make that clear.

Gozala commented 2 years ago

If this code is intended to be used in a CID as multicodec-content-type (this is what the spec currently calls it), then it should be ipld and not serialization. I think there is agreement that only IPLD formats should be there and we should update the CID spec to make that clear.

What the point of that table column if only value allowed is “ipld” ?

I suggest we start with “serialization” because it is a fact today. If we end up turning it into codec, using it in CIDs we can update that column to reflect that fact.

vmx commented 2 years ago

What the point of that table column if only value allowed is “ipld” ?

The Multicodec Table is a table that is not related to CIDs. It's just a list of things that map to certain numbers. The column is there to make sense, what such a number is used for. E.g. for a Multihash, or for IPLD Codecs that can then be used in CIDs.

Gozala commented 2 years ago

The Multicodec Table is a table that is not related to CIDs.

I have misunderstood what you were referring to with “there” in your previous comment.

Does my suggestion of starting with the “serialization” to reflect fact today and updating that as necessary in the future makes sense ?

Gozala commented 2 years ago

Can I go ahead and merge this ? Or do we still have some disagreements to resolve ?

rvagg commented 2 years ago

Re serialization and ipld:

If we end up turning it into codec, using it in CIDs we can update that column to reflect that fact.

I'm OK with this as a position if it's not going to be used for CIDs (a good way to think about this column might be something like: "does the decoder yield IPLD links?", and a CAR decoder does in fact yield links). But this raises the question of what this is being used for if not CIDs? Continuing from #239, I think most of us are assuming that's what this would be for. But apparently not?

So back to the original ask:

Add a code for CARs so that in .storage services we could identify them by multihash

How does this help you identify by multihash? Presumably you're going to hash the bytes and the digest from that gives you the multihash. What do you need the additional identifier for if not to make CIDs?

This is not a blocker btw, I think this can be merged, but the nuances might dictate needing to change that type column. I'm currently imagining this being a little like the CAR index format codes, 0x0400 and 0x0401 which are just unique identifiers for a single thing among a group of related things and I'm assuming that .storage services have a need for uniquely identifying a CAR as a thing among a group of related things, but I'm not sure what that would be, if not the same use-case as CIDs.

lidel commented 2 years ago

To clarify why I asked, the use case I have in mind is convention where raw and car codecs are used on HTTP Gateway as a way of requesting a single Block or a CAR with blocks for a DAG.

In this convention the multihash in a CID represents the root block of a DAG, and if you plan to use car with a multihash that has different meaning, we should agree on that now.

Gozala commented 2 years ago

How is that multihash generated?

I messed up when I said "we could identify them by multihash", because as you've all pointed out it's not really a multihash and I'm not sure we have term for it. We want to generate multihash for CAR and tag it with this code.

It is true that it sounds like CID, maybe it should be CID. Yet I really want to avoid the debate of whether it is good idea to identify things larger than libp2p block size limit with a CIDs. There are tradeoffs there and I'm not sure we're prepared to evaluate them yet.

I do think however that we can all agree on the fact that CAR is an established serialization format which can have it's own code.

I think we'll be in a better position to debate whether CAR as an IPLD codec is good idea after we've had a chance to evaluate that in our work. And only we're convinced that it's a right choice we can discuss tradeoffs and update table field if we choose so.

Gozala commented 2 years ago

In this convention the multihash in a CID represents the root block of a DAG, and if you plan to use car with a multihash that has different meaning, we should agree on that now.

I love the idea of making gateway capable of export DAGs, but I am concerned about overloading CID codec here because:

  1. CAR may not cover whole DAG (it may contain only subset of nodes)
  2. It may contain nodes from multiple unrelated DAGs.
  3. Same DAG can be represented by different CARs.

More broadly I think it is a mistake to think of CAR as DAG serialization format. Thinking of it as block set serialization seems a lot more accurate to me.

In regards to how we want to use it.

We want to generate CAR multihash by hashing bytes of the file (e.g. with sha256 and tagging accordingly) and than tag that multihash with CAR code. If we tag it with CID version we'd get a CID in a more traditional sense, but again I'm not prepared to have a debate on whether we should identify large things (greater than block size limit) with CIDs or not.

Gozala commented 2 years ago

I'm going to merge this given approvals and comments suggesting no blockers here. Happy to carry on related discussions at https://github.com/multiformats/multicodec/issues/239 instead