multiformats / multicodec

Compact self-describing codecs. Save space by using predefined multicodec tables.
MIT License
336 stars 201 forks source link

feat: add code for UCAN ipld codec #264

Closed Gozala closed 2 years ago

Gozala commented 2 years ago

Add a code for UCANs

lidel commented 2 years ago

Drive-by (ignorant) questions:

Gozala commented 2 years ago

Drive-by (ignorant) questions:

Hey @lidel I should have included more pointers myself, glad you've asked

  • What is the format? Binary JWT without Base64 envelope?

I'm posting links to more in depth answers, but short version is:

Codec uses dual representation, CBOR as a primary and JWT bytes (base64 enveloped) as a secondary. Our library would always produce CBOR variant but will be able to interop with any valid UCAN by representing / parsing them in secondary representation.

https://hackmd.io/@gozala/dag-ucan https://github.com/ipld/js-dag-ucan

  • Are existing raw and dag-cbor is not enough? What is the use case for a dedicated code?

@rvagg asked the same question so I'm getting an impression that there is a hesitance to add codes for formats that can be represented by existing codecs. If that is the case, would be good to document rational there, as there very well maybe a compelling reason not to.

Primary reason for dedicated code is we want to have UCAN CIDs that can be distinguished from arbitrary CBOR without trying to decode one as such and a flexibility to upgrade representation as we learn more in using them.

Other than that dag-cbor library very well could be another compound CBOR | RAW codec which would enforce specific CBOR schema. However downside would be that we would not be able to tell which type of block we have without having to try and parse it first.

vmx commented 2 years ago

@rvagg asked the same question so I'm getting an impression that there is a hesitance to add codes for formats that can be represented by existing codecs. If that is the case, would be good to document rational there, as there very well maybe a compelling reason not to.

That comes quite often in multicodec discussion. The question is, what shoukd the "content identifier" used for. For me it should give a hint on how to decode the data. It should not be about the semantic meaning, or where the data originated from. It's basically the information "how do I get links out of this blob:. There was a quite similar case were I responded in longer form: https://github.com/multiformats/multicodec/issues/204#issuecomment-765302334

mikeal commented 2 years ago

The test for a new codec vs dag-cbor should be “does the data roundtrip cleanly through the existing dag-cbor encoder/decoder?” Anything beyond that test gets into very subjective opinions about “what codecs are for.”

My understanding, based on the thread so far, is that dag-ucan does NOT roundtrip cleanly through the existing dag-cbor encoder/decoder because of the dual representation it has for compatibility with JWT based UCANs. So it should get a codec allocation (in a relatively high range).

rvagg commented 2 years ago

“does the data roundtrip cleanly through the existing dag-cbor encoder/decoder?”

I'm not sure that's something we've agreed to or formalised anywhere; and the main problem with this goes back to the long discussions about schemas and ADLs, where they're solving those kinds of issues at a layer (or three) above the codec. Transformations of data shouldn't really be a codec concern in general. Because we've punted on all of that in the JS stack we don't have very good tools to deal with some of these things, but they're starting to mature and be used in production in the Go stack.

I think the discussions we keep on having here about codecs (particularly in the context of the multicodec in CIDs) is more about trying to push back against the use of the codec code as a signalling mechanism for anything other than what function to pass the bytes through to yield data model. Like if Filecoin wanted their own codec entry to say that "these bytes are dag-cbor, but they're specifically used within the Filecoin system". So, in the context of UCAN that might apply if this code is being requested as a mechanism to signal that "these bytes are dag-cbor but will yield a UCAN shaped data structure in the data model". That's not really what CIDs (at least in the limited CIDv1 form) are supposed to do. That kind of signalling is a separate problem that should be solved by other mechanisms within a system (usually that context simply comes from where it's used, e.g. "data linked by this property is always a UCAN" - and schemas help formalise this too).

.. back to the old discussion - do we want proliferation of codecs because everyone wants a dedicated code to signal data belongs to their system even though it's all dag-cbor (or whatever) - or are we interested in providing solutions above the codec layer. Opening the door to using the codec code in a CID to signal the specific data structure and use of the data rather than the basic decoder to be used is going to lead to a lot more codec requests. Perhaps that's OK, but we're going to have to be OK with solving the set of problems that comes with, like how we get all those codecs working in our various technologies like go-ipfs, Filecoin and friends (WASM? codec alias lookups? ...?). One of the main drives behind Schemas (and ADLs) was to shift this problem up a layer.

Gozala commented 2 years ago

I think the discussions we keep on having here about codecs (particularly in the context of the multicodec in CIDs) is more about trying to push back against the use of the codec code as a signalling mechanism for anything other than what function to pass the bytes through to yield data model. Like if Filecoin wanted their own codec entry to say that "these bytes are dag-cbor, but they're specifically used within the Filecoin system". So, in the context of UCAN that might apply if this code is being requested as a mechanism to signal that "these bytes are dag-cbor but will yield a UCAN shaped data structure in the data model". That's not really what CIDs (at least in the limited CIDv1 form) are supposed to do. That kind of signalling is a separate problem that should be solved by other mechanisms within a system (usually that context simply comes from where it's used, e.g. "data linked by this property is always a UCAN" - and schemas help formalise this too)

If understanding it correctly what you're proposing here is that codecs in CIDv1 are basically to signal intermediate representation (IR) of the block. Signaling final representation (at least in JS) will be solved by schemas someday in the future.

This is reasonable position however as far as I can tell it does not address case where multiple underlying IRs could be used under the hood of the same final representation, which is exactly the case for dag-ucan library.

I am also somewhat doubtful of the proposition that "context simply comes from where it's used". Our context is we get CAR files from folks with arbitrary blocks, we could decode every single one and then try to match known set of shapes but it seems that cheaper tagging mechanism would be a better option.

For what it's worth I was tempted to tag cbor encoded bytes with multicode instead to have that second layer signalling, but that would make it non cbor. Maybe there could be that kind of second layer signaling on top of IR ?

aschmahmann commented 2 years ago

Sorry for the long post (and the related comment in https://github.com/multiformats/multicodec/issues/204#issuecomment-1104404093), I hope it's helpful/clarifying. @Gozala my comments and questions posted are advisory and meant to be of use to you and your project not to blocking the allocation of a high range code. Just trying to help you see the potential landmines along the way and avoid them if easy enough 😄.

If understanding it correctly what you're proposing here is that codecs in CIDv1 are basically to signal intermediate representation (IR) of the block. Signaling final representation (at least in JS) will be solved by schemas someday in the future.

Correct. Quoting from the IPLD Specs https://github.com/ipld/ipld/blob/master/docs/codecs/index.md?plain=1#L10-L11

"IPLD codecs are functions that transform IPLD Data Model into serialized bytes so you can send and share data, and transform serialized bytes back into IPLD Data Model so you can work with it."

Codec uses dual representation, CBOR as a primary and JWT bytes (base64 enveloped) as a secondary.

@Gozala perhaps a stupid question. Why not propose some codec like jwt-b64 (or jwt-ucan) that allows you to represent the JWT bytes in the IPLD data model to make it easier to work with both forms? You'd then have two codecs for your data jwt-ucan and dag-cbor which look the same to the data model and basically identical to any IPLD data model tooling.

Describing your setup as having two IRs where one of the "IRs" is just base64 encoded bytes feels wrong it's not really an IR at all, but the base serialized representation. It seems like doing this and then performing validation on top (e.g. using schemas, but whatever works for you) would be straightforward.

This is reasonable position however as far as I can tell it does not address case where multiple underlying IRs could be used under the hood of the same final representation, which is exactly the case for dag-ucan library.

Correct, there are other slots in the IPLD stack where such layering could be appropriate. Some examples include:

Note: If I understood how UCAN's work correctly then having a jwt-ucan codec is still better than ADLs because the raw serialized bytes is not a useful data model form to work with

Gozala commented 2 years ago

@Gozala perhaps a stupid question. Why not propose some codec like jwt-b64 (or jwt-ucan) that allows you to represent the JWT bytes in the IPLD data model to make it easier to work with both forms? You'd then have two codecs for your data jwt-ucan and dag-cbor which look the same to the data model and basically identical to any IPLD data model tooling.

That is more or less what implementation does it is effectively two codecs composed. However not every UCAN can be represented in dag-cbor transparently as key order and even whitespaces would affects signatures. That is why implementation uses dag-cbor as default and falls back to jwt version if specific UCAN can't be represented in dag-cbor (would result in wrong signature).

That is to say when you do UCAN.parse(jwt_string) data model coming out will be different depending on how jwt_string was formatted. If JSON keys were ordered and no whitespaces were present model will contain only actual values, however if that was not the case model is basically raw bytes.

Gozala commented 2 years ago

Describing your setup as having two IRs where one of the "IRs" is just base64 encoded bytes feels wrong it's not really an IR at all, but the base serialized representation.

I am not sure what would be a more accurate way to describe this, but broadly speaking there are two representation one that retains whitespaces, key order, quote types etc... and other that does not. How those two are manifested in practice is probably less important, although I'm happy to learn better ways.

Gozala commented 2 years ago

if you go the ADL route please report back whatever you decide because it may serve as a good case study for others 🙏.

After discussing this yesterday, I went back and changed implementation to make it an ADL that:

  1. Uses CBOR encoding with a specific schema when possible.
  2. Falls back to RAW encoding when (CBOR encoding would lead to signature changes).

Library still provides codec interface but it will encode / decode into one or the other representation. Additionally it provides specialized CBOR codec that basically enforces schema and RAW codec which mostly just provides UCAN specific view of underlying byte array.

Overall I think this end up been an improvement but here are pros and cons as I see them

Gozala commented 2 years ago
  • IIUC (perhaps very incorrectly) this is what was described here https://hackmd.io/@gozala/dag-ucan since codes are returned in the last slide as either dag-ucan or raw but processed through a single wrapper library.

I actually went back and force on this so at some point in time I used same code and at other times different. This is actually what swayed me to trying ADL route.

Gozala commented 2 years ago

When doing this you may want to ask yourself some questions like: Am I OK with making an ADL called ucan that is able to process data from some arbitrary codec (e.g. DAG-JSON) as long as it complies with my data model or do I want to restrict ucan to only work the DAG-CBOR and Raw? Both are doable and you'll have to decide what's best for your use case.

This is interesting point. I think there is no real reason why current implementation needs to be tied to CBOR it could use DAG-JSON just the same.

Gozala commented 2 years ago

Closing since I end up going with ADL route instead

oed commented 2 years ago

@Gozala Do you have an example of what the ADL approach for this could look like?

Gozala commented 2 years ago

@Gozala Do you have an example of what the ADL approach for this could look like?

@oed readme here https://github.com/ipld/js-dag-ucan attempts to describe it although I'm not sure if this is ADL in classical terms (which I think fairly loosely defined) this is how I've described them elsewhere

what I end up with is probably not an ADL. It's more of a view or a lens over the block. So kind of like lazy codec + schema thing. So I have RAW JWT UCAN view and codec agnostic UCAN view both implementing same interface but differently

I said they're probably not ADLs in classical terms because they can't be made codec agnostic and here is some context on why https://github.com/ipld/js-dag-ucan/issues/23