ipld / specs

Content-addressed, authenticated, immutable data structures
Other
592 stars 108 forks source link

feat: add dag-jose format #269

Closed oed closed 3 years ago

oed commented 4 years ago

This adds a specification for dag-jose which is a format for signed and encrypted JSON.

The dag-jose codec will allow JWS and JWE data structures to represent IPLD dag nodes. JWS stores the payload encoded as base64url which makes it impossible to follow links in signed data without first decoding the payload. The implementation of this codec should automatically decode the payload of JWS object so that links can be followed. JWE objects can't be read before they are encrypted, however the decrypted ciphertext should still be able to contain links. The implementation of this codec should include a decryption function that outputs an object with valid IPLD links.

Please let me know if you have any questions or if anything could be clarified!

oed commented 4 years ago

@mikeal In the case of the protected header it would totally make sense to decode the base64url encoded string however. This would make it easier for developers to interact with the data.

Since the implementation needs to bundle decryption and verification functionality anyway it might make sense to decode all things to binary anyway. I think the most important thing to support is that any existing JOSE object should be possible to serialize using the codec, but that seems separate.

Will think some more about this and make another update!

mikeal commented 4 years ago

ordering

It might be easier to just inherit the ordering rules from dag-json rather than calling out a specific ordering of each property in the format. Is the ordering here part of the JOSE spec or was this just for dag-jose?

mikeal commented 4 years ago

Since this codec requires data to be encoded in a very specific data model representation I think we need an IPLD Schema for it, similar to what we have for dag-pb.

mikeal commented 4 years ago

I think the most important thing to support is that any existing JOSE object should be possible to serialize using the codec, but that seems separate.

I should clarify a bit what needs to be in the spec vs what is an implementation decision.

I take “any existing JOSE object” to mean “objects returned from a JOSE signing/encryption library.” Those objects are going to vary a bit in terms of structure between libraries and especially between languages. Codec implementations are free to decide what they do with these inputs and that exact behavior doesn’t need to be part of this spec.

All that we really need specified are the rules for writing out the binary representation of the Block and how that Block data is then decoded into IPLD Data Model. Codecs are free to decide how to take any given input and turn it into the format specified here.

For instance, the IPLD Schema for this codec should specify Bytes kinds for all the base64url usage. It should then specify that the proper representation of Bytes in this codec is a base64url string encoding. In fact, you may want to call out that dag-jose DOES NOT use the dag-json binary representation of { ‘/‘: { bytes: multibaseString } }.

oed commented 4 years ago

It might be easier to just inherit the ordering rules from dag-json rather than calling out a specific ordering of each property in the format. Is the ordering here part of the JOSE spec or was this just for dag-jose?

Good point. I'll make this change.

Since this codec requires data to be encoded in a very specific data model representation I think we need an IPLD Schema for it, similar to what we have for dag-pb.

Makes sense. I'm not super familiar with IPLD Schema. Any pointers except for dag-pb?

I take “any existing JOSE object” to mean “objects returned from a JOSE signing/encryption library.” Those objects are going to vary a bit in terms of structure between libraries and especially between languages. Codec implementations are free to decide what they do with these inputs and that exact behavior doesn’t need to be part of this spec.

Ah, this clarifies your point about having the payload be a separate object and link to it. Let me think about it a bit more!

vmx commented 4 years ago

I'm not super familiar with IPLD Schema. Any pointers except for dag-pb?

https://specs.ipld.io/schemas/

rvagg commented 4 years ago

There's an authoring guide for schemas if you follow @vmx' link but there's also a smattering of examples in this repo. Do a git grep -- ipldsch and look outside of the schemas/ directory (that content's at that URL). There's also some being used in my WIP bitcoin docs to describe the shape of data https://github.com/ipld/specs/pull/270. What you're describing here should be simple enough that you shouldn't need to reach for very complex descriptors. It's just really one big top-level struct containing primitives isn't it? The payload is nested, obviously (I think maybe that should be an Any, which is not strictly valid Schema language yet but it might have to do for now), but aside from that I think you just have simple properties hanging off a a single container.

oed commented 4 years ago

Just updated the spec to be much more clear and well defined. Hopefully it should be much easier to interpret now. Also added IPLD schemas to describe the data structures. Not sure I'm using union correctly however.

rvagg commented 4 years ago

A couple of more general points of feedback:

warpfork commented 4 years ago

Mostly on a similar page to @rvagg but less worried about a couple things:

oed commented 4 years ago

A question before I go though and update the spec based on our discussion @mikeal @vmx @rvagg

Right now the data that is put into the block is just serialized by stringifying the json data. Does it make sense to actually serialize this as CBOR instead? This would be a bit more space efficient and would not affect the input data of the encode function or the output of the decode function.

carsonfarmer commented 4 years ago

This seems like a win win to me.

JonasKruckenberg commented 3 years ago

I've spent a bit of time working on a javascript implementation ( I did not publish anything yet though ) and I have thought on the payload thingy:

I think we should be sticking as closely to the RFC as possible, without defining "weird" ipld quirks and edge cases, especially when it comes to the payload, I'd say implementations should be free to support specific "cty" mime types such as "DAG-JSON" and apply the correct encoding and decoding as they see fit, for example, an implementation might see a JWE/JWS with the "cty" header "application/DAG-JSON" and decode it for the user ( i.e. decoding json-ified CIDs etc. ). I another case they might see a "cty" header with the value "application/mp4" or a missing "cty" header, here they should just return the raw representation + the content type so the application can handle the rest. We should ideally not impose any restrictions on the payload field, and just keeping it as base64url encoded bytes IMO.

Right now the data that is put into the block is just serialized by stringifying the json data. Does it make sense to actually serialize this as CBOR instead? This would be a bit more space efficient and would not affect the input data of the encode function or the output of the decode function.

My quick benchmark showed that for the given JWE/JWS structure CBOR reduces the size by only 20 Bytes while being up to 5x slower than JSON ( I used borc and fast-json-stringify I don't think it's worth it for the overall dag node, but we could ( and should?) recommend the use of CBOR when the payload is an arbitrary object. ( And then setting the "cty" header to "application/cbor" of course )

oed commented 3 years ago

Hey @JonasKruckenberg! We've already got a working javascript implementation here: https://github.com/ceramicnetwork/js-dag-jose

Unfortunately this spec is out of date (we will be updating it shortly). Main thing to note is that we've limited the payload to only be allowed to be a CID.

Also, 20 bytes can become quite a lot if you have a lot of objects.

JonasKruckenberg commented 3 years ago

Hey! Thanks for the quick reply, it's not working for me sadly, that's why I'm working on my own. Maybe I can look into dag-jose at some point to figure out whats wrong and create an issue.

Also I don't think it should be limited to just CIDs. To give you a bit of context: Im currently working on my own append-only CRDT on ipfs similar to OrbitDBs ipfs-log. I have Dag nodes that point to arbitrary data( just like the approach you settled on ) but also point to the previous node in the merkle clock. Having to put this information into it's own block and only wrapping its CID in the JWE would make any append action take 3 or 4 ipld requests which would be brutal. Im sure this problem would be shared by almost anyone implementing more complex datatypes via ipld. Or am I missing something? 🤔

JonasKruckenberg commented 3 years ago

Also, 20 bytes can become quite a lot if you have a lot of objects.

You're right about this one yeah, I didn't think about having a lot yet. Maybe the speed problem should be addressed at the library level

oed commented 3 years ago

@JonasKruckenberg Yeah, we are doing the same on Ceramic. The syncing issue can be mitigated in multiple ways, including DagSync in the future. A hack you could do to get around the only CID "issue" is to use an identity multihash (and thus store the whole content in the CID), this is what we will do for encryption for example.

Please file an issue on the js-dag-jose repo if you are having trouble with it!

Also note that we probably gain more than 20 bytes since we encode the base64urls to Buffers before we pass the object to borc.

alexjg commented 3 years ago

Hey folks, I've been working on the Go implementation of this. I've updated the spec to match current implementations. Specifically I've made changes that:

rvagg commented 3 years ago

Use Bytes instead of String {String:Any} for authenticated headers

Maybe vaguely relates to a conversation @warpfork and @willscott have been having about some funky Bytes<->String map keys that Filecoin has been doing. https://github.com/alexjg/go-dag-jose/blob/4ef144c0cad71149125cd2b5393b70cf6f35db7c/dagjose/dagjose.go#L88

I don't really know how this moves forward, but the fact that you have a go-ipld-prime implementation based on this suggests to me that it's most likely doing the "correct" things since it should be much more unforgiving than what you could do in JavaScript.

oed commented 3 years ago

Use Bytes instead of String {String:Any} for authenticated headers

I'm a little confused by this statement now. Protected headers (authenticated) were encoded as base64url strings before, now we convert that to bytes for the encoded format. Not sure what {String:Any} does here? That's what the unprotected headers use right?

alexjg commented 3 years ago

There is a typo in that comment. It should read "use Bytes instead of {String:Any} for authenticated headers"

To clarify, the original schema looked like this (just including the signature, but it's the same for anything with an authenticated header):

type EncodedSignature struct {
   header optional {String:Any}
   protected optional {String:Any}
   signature Bytes
 }

And I've changed that to

type EncodedSignature struct {
  header optional {String:Any}
  protected optional Bytes
  signature Bytes
}

So you can see that the protected header is now just Bytes rather than {String:Any}.

oed commented 3 years ago

Makes sense, thanks for clarifying!

oed commented 3 years ago

Hey all, anyone had a chance to look at this updated spec?

mikeal commented 3 years ago

LGTM except for one comment I’d like @warpfork to weigh in on.

warpfork commented 3 years ago

I left one more comment which I'm afraid is fairly blocking -- but other than that, the rest of this looks good to me.

Wanna disclaim that I don't have a deep understanding of JOSE itself (and I'm definitely assuming any cryptographic semantics have been reviewed by the JOSE specifiers, and I'm not checking that in the slightest), but what I can do is review for whether this spec seems to explain it to me well enough to work with, and I think the answer to that (modulo comment) is... "yes". So that's good enough for me!

oed commented 3 years ago

Pushed an update that removes the union @warpfork 🙏 Seems like the way to distinguish the two different objects was already described:

" ... implementors should follow section 9 of the JWE spec and distinguish between these two branches ... "

warpfork commented 3 years ago

Ah, I missed that. Mea culpa, thanks for the knock upside the head.

...(after reading)... Uuf, I have to confess I don't actually like the JWE spec here.

All these methods will yield the same result for all legal input values; they may yield different results for malformed inputs.

(Emphasis mine.)

I'm not a fan of that kind of ambiguity, because ISTM it's a pretty serious bug-breeder-reactor. But... if that's the IETF RFC, hokay. I'm not gonna argue.

vmx commented 3 years ago

I had a good chat with @oed about inline CIDs in regardst to this spec. To me the usage of inline CIDs should be discouraged (TODO a proper write up why). With IPLD Schemas we have other ways to solve the some of the problems that inline CIDs solve. We talked about JWS as well es JWE, so I split this comment into two parts as although have very similar solutions are still distinct problems.

JWS

The IPLD Schema currently contains:

type DecodedJWS struct {
  payload String
  signatures [DecodedSignature]
  link: &Any
}

And if you want to inline data, you could use an inline CID. I propose instead using a kinded union for that:

type DecodedJWS struct {
  payload String
  signatures [DecodedSignature]
  link: CidOrInline
}
type CidOrInline union {
  | Cid &Any
  | Inline map
} representation kinded

type Inline struct {
  // The IPLD Codec code
  int Codec
  data bytes
} representation tuple

An inline CID gives you an indication (via the IPLD Codec of the CID).how the inline data should be decoded. We don't want to lose that functionality. Would an IPLD Path traversing this structure still be the same as with an inline CID?

Is that better than in inline CID ? Or does an inline CID exactly serve this purpose (which is really very similar, except that it contains 2 varints more that are not strictly needed)? Or should we have an DIC (direct inline content) instead which is specified as <ipld codec><data-length><data> which supports transparent traversals?

JWE

JWE is encrypting data which might be a CID or inlined. You want to be able to pad the data with arbitrary bytes for cryptographic purpose. You could specify the Inline part the same way as above for JWS.

Your decoder (let's expect a binary encoding that is using varints for now) would need to look at the first varint to determine whether the encrypted data is a CID or whether it was inlined. It leads to the same questions as for JWS, whether this is better than an inline CID.

Identified problems

Inline CIDs support transparent pathing even over different IPLD codecs. Can we provide something similar with the current Schemas?

oed commented 3 years ago

Thanks @vmx and everyone involved. It's great having this merged 🎉

JonasKruckenberg commented 3 years ago

Great news! Just a quick thought however, since this codec standard has potentially far reaching consequences if IPFS is to actually take off and be used by the masses we should put A LOT of care into getting this right. So my suggestion for a future revision of this and any other specification where it is applicable ( e.g. dag-cose ) is to require a Security Considerations section roughly following RFC 3552. Each specification should outline what implementers should watch out for and guidance to create secure implementations. Again great work @oed 👍 but as this will ( hopefully ) become the core foundation of future data privacy we have to be extremely careful with everything that we do. As for specifics with dag-jose we can discuss this in a new issue, just wanted point this out! Cheers!