multiformats / multicodec

Compact self-describing codecs. Save space by using predefined multicodec tables.
MIT License
340 stars 205 forks source link

propose car-metadata multicodec #334

Closed willscott closed 1 year ago

willscott commented 1 year ago

When transporting car data, it can be useful to insert a block of data with key/value metadata signalling inline with the car data blocks.

This PR allocates a codec car-metadata to signal such data as blocks within the stream of car blocks.

willscott commented 1 year ago

cc @lidel @hannahhoward @bajtos

aschmahmann commented 1 year ago

This is a bad idea.

  1. You are using the codec as nominative typing rather than structural. While there's no definition or link to what the codec is, which is also standard here, AFAIK this is just nominative typing on some CBOR. See the discussions in https://github.com/multiformats/multicodec/issues/204 for more information about the types of IPLD codecs that have been accepted and rejected here historically.
  2. The particular thing this nominative type is being used for (again not defined here, but something I understand from reading some Notion docs and Slack messages) is fixing a deficiency in the CAR format by mixing the control plane and data plane which has received heavy pushback from stakeholders. Seems like this effectively is a change to the CAR format anyhow since it'll be expected across CAR implementations so why not actually change the format and go through the process of working with stakeholders of that format.
bajtos commented 1 year ago

Thank you @aschmahmann for a quick response and the pointer to the previous discussion about what qualifies as a codec.

The particular thing this nominative type is being used for (again not defined here, but something I understand from reading some Notion docs and Slack messages) is fixing a deficiency in the CAR format by mixing the control plane and data plane which has received heavy pushback from stakeholders. Seems like this effectively is a change to the CAR format anyhow since it'll be expected across CAR implementations so why not actually change the format and go through the process of working with stakeholders of that format.

In our use case, we want to append an extra metadata block at the end of the CARv1 file to provide a kind of checksum: byte length of the CAR file, hash of the CAR bytes, and signature of the previous two fields. Do you consider such metadata as a control plane too?

rvagg commented 1 year ago

Yeah, I'm not really a fan either, for the same reasons @aschmahmann outlined (and as per all the previous discussions on this stuff).

Would it not be enough to just put a dag-cbor block at the end with a strongly typed schema that you can check?

(as dag-json)

{
  "car-metadata/v1": {
    "properties": {
      "byte-length": 18390,
      "multihash": {
        "bytes": {
          "/": "..."
        }
      }
    }
    "signature": {
      "bytes": {
        "/": "..."
      }
    }
  },
}

(schema)

type CarMetadata union {
  Metadata "car-metadata/v1"
} representation keyed

type Metadata struct {
  properties CarProperties
  signature Signature
}

type Properties struct {
  byteLength Int (rename "byte-length")
  multihash Bytes
}

type Signature ...

This was one of the original design goals of IPLD schemas - they should be fast and efficient enough to do these kinds of checks.

There's also the "messaging" hack I prototyped for CARv2: https://github.com/ipld/go-car/pull/322 & https://github.com/ipld/js-car/pull/89 after a similar request for this kind of thing came up from the web3.storage team who wanted to put some metadata inside of their CARs. It was never used, and it fits in before the data payload, which may not be ideal - but since a CARv2 wraps a CARv1, given a 2-step process it might make even more sense than this since you're providing metadata about a thing within the thing itself, which is not so ideal since the metadata needs to be excluded from the thing somehow—and we'd have to assume that there'd be some layer that would strip this cleanly so it doesn't end up making a mess of other layers of the stack where such a codec code would make a mess of things.

rvagg commented 1 year ago

I'm sure @lidel can pull up some record of the discussions about putting a tombstone at the end of trustless gateway CAR responses, those are probably relevant here and maybe back on the table?

If this becomes part of the protocol, then having a new codec code in CIDs is going to cause a bit of havoc for users—doing a simple curl of the CAR will leave you with a CAR that has a block you can't decode without having this codec in your software's registry. A similar problem applies for a dag-cbor (or other common codec) tombstone block, but it's a bit more manageable - you can at least get your content out without your parser/checker borking at the block, you could easily import it into a Kubo node, parse it in your browser, without having to have this codec registered; you just have a weird block to deal with, which could be a problem in some scenarios.

But having said all that, if we do go with a tombstone, then a couple of extra thoughts:

willscott commented 1 year ago
willscott commented 1 year ago

I think the next step is to propose a spec+code change in go-car on the specific semantics for reading/writing a car stream against a network with support for metadata blocks.

I would appreciate pointers on other spec impact or parts of this project that should also be considered as we proceed

rvagg commented 1 year ago

Back to the basic request here of adding a new serialization/codec to the table - we can't do that where you're asking for a duplicate of an existing serialization method. We've got ample history of rejecting such additions where someone just wants a code to represent a codec+schema, which is what this is. So as long as this is, at base, cbor, json, or some other existing unschema'd codec, then it can't go in. Schema signalling via codec code is a door that we've tried hard to keep closed, pushing people to keep that level of signalling out of the CID. The demand for such signalling tells us that there is a hole in the stack that people want utilities for, but that's where discussions about CIDv2 (and other methods) come in; jamming it into CIDv1 is going to make things much harder to manage.

willscott commented 1 year ago
bajtos commented 1 year ago

In https://github.com/ipfs/specs/pull/431, we pivoted to a different solution that does not require any new multicodec.

I think this PR can be closed now.