ipld / specs

Content-addressed, authenticated, immutable data structures
Other
592 stars 108 forks source link

New codecs for JOSE and COSE #251

Open oed opened 4 years ago

oed commented 4 years ago

At 3Box we've been looking into how to best encode signed objects in IPLD. In the DID community which we are quite familiar with one direction with a lot of weight is using JOSE and COSE. These standards propose a way of encoding both signed and encrypted objects. These formats can be used together with IPLD to create signed and/or encrypted dag objects.

The purpose of this issue is to get feedback on the direction and to check interest in something like this.

I propose that we introduce two new IPLD codecs:

Below I'll describe how this would work for dag-jose, but it should work quite similarly for dag-cose.

dag-jose

JOSE stands for JSON Object Signing and Encryption and is a standard that includes JWS, JWT, JWE, etc. This standard encodes the payload along with a header and signature, or header and encrypted payload in a JSON format. Usually a JWT is encoded in this format: eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJKb2huIERvZSJ9.mnBcKK-9setCco03NtYws-RMlYXP3LGlDu2RUB7vetQ but it can also be encoded as "flattened":

{
  payload: 'eyJzdWIiOiJKb2huIERvZSJ9',
  protected: 'eyJhbGciOiJIUzI1NiJ9',
  signature: 'mnBcKK-9setCco03NtYws-RMlYXP3LGlDu2RUB7vetQ'
}

This flattened json structure could just be stored in dag-cbor and decoded and verified later. However this makes it impossible to automatically follow any ipld links that might exist within the payload since the ipld resolver would have no way to interpret the Base64 encoded data and understand that there is a link within it. Because of this the creation of a dag-jose is needed. This would allow the ipld resolver to include decoding functionality and automatically follow links within the signed payload, which is helpful when following a linked list of signed messages or when you want to retrieve an object that a signed object refers to.

In js-ipld this could be implemented something like this:

const signedObj = await ipfs.dag.get(cid)

Here the signedObj would be an instance of the JoseNode class. This class has a method for accessing the decoded payload, and one for verifying the signature by passing a public key.

We could also resolve linked nodes like this:

const result = await ipfs.dag.get(cid + '/path/to/data')

For JWEs, encrypted objects, it wouldn't necessarily be possible to automatically decrypt objects if the ipld resolver doesn't know the private key needed to do this. Instead we would have to call a method with a decryption key in order to get the payload of the node.

JoseNode

The JoseNode class exposes different methods depending on if the object is a JWS or JWE.

JWS

joseNode.verify(publicKey)

JWE

joseNode.decrypt(secretKey)

This class would also have a few different properties.

joseNode.payload

joseNode.protected

joseNode.header

joseNode.isEncrypted // to check if we need to call "decrypt"

Implementation

For javascript the most prominent implementation is: https://github.com/panva/jose/ it supports all types of JOSE objects. For go I've found https://github.com/square/go-jose but there might be better libraries out there.

Algorithm support

For 3box use cases we are looking to support secp256k1 for signed objects (jws, jwt), and ed25519 using Xchacha20poly1309 for encryption. Currently the Xchacha20poly1309 is not supported by the javscript jose library linked above, so a modified implementation needs to be created. Possibly as an extension to this library.

A note on encrypted objects

Fow now I've left out the consideration of having dag objects being automatically encrypted. I know that there have been discussions about this within the IPLD community and that there might be some clever ways to do this. Any input on this would be very much appreciated.

mikeal commented 4 years ago

Is the serialized form valid JSON or is it a new format?

If it’s valid JSON then these would not be new codecs.

rvagg commented 4 years ago

I imagine though that the desire would be for the IPLD data model form of decoded data not just be the cyphertext but the actual decrypted data that's within the container? So "just use dag-json" is probably not a ideal UX. It seems to me though that this kind of encoding would be a perfect use-case for advanced layouts, so it's handled above the encoding layer but presents to the user as the decoded form. We've talked about this specific use-case a number of times and maybe this container format is a good one to start with?

mikeal commented 4 years ago

I imagine though that the desire would be for the IPLD data model form of decoded data not just be the cyphertext but the actual decrypted data that's within the container? So "just use dag-json" is probably not a ideal UX. It seems to me though that this kind of encoding would be a perfect use-case for advanced layouts, so it's handled above the encoding layer but presents to the user as the decoded form. We've talked about this specific use-case a number of times and maybe this container format is a good one to start with?

I see the question of “is this a codec?” a bit more basic than all of this though.

I’m going to split the signing and encryption use cases apart because they have very different requirements and touch different parts of our stack, but before I do there’s some really basic stuff we should get more explicit about.

Right now, the codec is a signal for only one thing. That may change, we may want it to change in order to do this stuff well, but I don’t think it’s debatable that today the codec only signals one thing. The codec signals how to translate bytes into Data Model and Data Model into bytes. That’s it.

Signing

For signing, it’s unclear to me what a new codec would give someone beyond what the existing codecs, including the non-dag codecs (json and cbor), already offer. When you are working with signed data you need to first know if the data includes a signature or not, and other than validating the signature matches there isn’t much generic work that can be done, it has to be handed off to some kind of application. In other words, after IPLD we might say “this is signed by this public key and it validates” there still needs to be an application layer that decides what to do about that. Are we meant to reject certain data from certain public keys? Are we meant to look up this identity and provide some kind of additional information? Does this associate with a permission of some kind?

None of this really belongs in IPLD, if it’s just data then IPLD should decode the data and hand it to the application in order to contextualize it just like any other information. This also means that, unlike the encryption case below, you can’t provide a level of transparent traversal through the nodes because the signing info you’d be shedding is actually needed by the application layer.

This may sound like I’m putting distance between IPLD and this ecosystem but this does quite the opposite. There are surely all kinds of tools for validating these signatures and providing APIs for identity and permission to application developers, and what they all will accept as input is the raw data, which we pass off and don’t mess with or try to hide. It ensures maximal compatibility by just staying out of the way.

Encryption

Encryption is a bit different. We’ve been building an ecosystem of tools for IPLD and all them need to read data. If the data is encrypted then you can’t use any of these tools, that’s a big problem.

In order to solve this we’ll need some sort of signal to say “this data is encrypted with this public key.” Our tools will then need to incorporate some sort of decryption service that can retrieve the necessary private keys and decrypt the data.

I’m not convinced that the signal for this should be the codec. I’m open to it, but I haven’t seen a great argument for it given some of the problems. One reason I’m not that sold is that something like JOSE/COSE wouldn’t actually be usable.

Here’s the problem. The codec is the only signal IPLD has for looking up a codec to decode the data. If we say “this is encoded as JOSE” then IPLD can decode the JSON and pass off the information inside of it to a decryption service. Once it decrypts the data it now has a bunch of decrypted bytes, what does it do with them? We actually lost the codec signal for the decrypted bytes so we don’t know what codec can turn the decrypted bytes into Data Model.

So now we’re going to have to add some non-standard IPLD specific property to JOSE, in which case what was the point of using the existing standard just to add non-standard properties to it? And we’re going to do this for every encryption envelope format?

And yet, if we don’t have the codec as a signal, what do we have?

One option, which I’m not a fan of, is to just require explicit signaling by the user. So, if you’ve got a selector that traverses through some encrypted blocks, you tell it which are encrypted and the selector engine knows how to work with it. I’m not a fan of this because the usability is horrendous and it puts a huge cost on encrypting data, something we should be trying to decrease as much as possible.

Another option is to extend CID, which would probably mean a CIDv2. There’s a few approaches we can take here.

If we want to rely on our own encryption standard, we could write it in IPLD Schema and when you see a CIDv2 you know “this data is encrypted with the IPLD Encryption Schema.” We talked about doing something like this for composites a while back but it was all shelved until WASM is more mature.

If we want to leverage existing standards like JOSE/COSE we actually would use a multiformat codec, but we’d extend CID to have two codecs, one for the encryption envelope and the other for the decrypted data. Only issue I see is that the codec would be visible in plaintext and there may be a security reason why we’d want to hide that. Still, you could do the scuttlebutt thing and hide that behind a replication key with another envelope.

And finally, if we don’t want to extend CID we could use the first few bytes of any decrypted data for the multiformat codec. This would hide the codec from plain view but this inserts highly predictable bytes to the front of any encrypted data, which is, ya, not great.

Lots of options, I’m not sure where the best tradeoffs are, but I don’t think adding a codec alone is going to get anyone what they really want here.

oed commented 4 years ago

Thanks for the detailed replies!

Is the serialized form valid JSON or is it a new format?

Forgot to add above JOSE: JSON Object Signing and Encryption COSE: CBOR Object Signing and Encryption

So these are standards for signing and encrypting JSON and CBOR objects. In the case of JOSE the payload is a base64 encoded JSON object when signed.


It seems to me though that this kind of encoding would be a perfect use-case for advanced layouts

I'm not familiar with these advanced layouts. Is there any examples of something using that right now?

Signing

This also means that, unlike the encryption case below, you can’t provide a level of transparent traversal through the nodes because the signing info you’d be shedding is actually needed by the application layer.

I think that's fine. One use case for this is that I might want to do a graphsync of a particular DAG, and then once I have all of that data the application can validate the signatures.

My main concern here is the ability to follow links that are encoded in the signed data. It would be possible to simply store the payload as JSON (decode the base64 string that was generated from the JWS encoding. Then when verifying a signature encode it back to proper JOSE encoding.

Encryption

In order to solve this we’ll need some sort of signal to say “this data is encrypted with this public key.” Our tools will then need to incorporate some sort of decryption service that can retrieve the necessary private keys and decrypt the data.

Note that data may be encrypted with a public/private key pair, but may also be encrypted using a symmetric key.

Once it decrypts the data it now has a bunch of decrypted bytes, what does it do with them?

Both JOSE and COSE are specific about how you convert these bytes to back to JSON and CBOR respectively. Maybe I'm misunderstanding, but I don't see how this would be a problem.

Only issue I see is that the codec would be visible in plaintext and there may be a security reason why we’d want to hide that.

What would that be? I imagine if you really wanted to know the codec you could just look at the code that generated the data.

mikeal commented 4 years ago

Both JOSE and COSE are specific about how you convert these bytes to back to JSON and CBOR respectively. Maybe I'm misunderstanding, but I don't see how this would be a problem.

This means that it’s not all that suitable for IPLD because we need to encrypt graphs of arbitrary formats. The most obvious example is raw blocks.

We also still have this issue within just json/cbor because you could do valid encodes of either dag-json or json and there isn’t a good way to differentiate ;(

So these are standards for signing and encrypting JSON and CBOR objects. In the case of JOSE the payload is a base64 encoded JSON object when signed.

Does this mean that a signed object doesn’t have the payload as JSON but as a string inside JSON that is the base64 encoded payload? I expected something like this for encrypted data but didn’t anticipate this would be the case for signed data as well. If this is the case, we are going to need new codecs even for the basic stuff because we’ll want to decode that signed data to make it traversable.

JSON’s lack of binary support is forcing some incredibly bad performance issues here. It’s unavoidable, we already deal with this in dag-json (links and binary are base64 encoded) but as we also explore some of the nasty corners of CBOR I feel like we’re being pushed more and more towards authoring a new block format the is ideally suited to what we’re doing here. Anyway, just some thoughts, none of this should block us working on good support for these standards.

Only issue I see is that the codec would be visible in plaintext and there may be a security reason why we’d want to hide that. What would that be? I imagine if you really wanted to know the codec you could just look at the code that generated the data.

Anything you expose about encrypted data is a vector of attack. Knowing something about the payload gives you information about what characters are likely to occur in certain places which can aid in cracking vulnerable algorithms.

The scuttlebutt community has spent more time thinking about this than most, and they find that even the shape of a graph and the size of the branches allows you to make assumptions about what is in the data you may not want visible, which is why they hide the links behind another layer of encryption with a replication key. For instance, just the size of a graph, tracked over time, gives you knowledge about how active the user has been in adding data, and if you track changes to the graphs being published you even have some idea of when they are adding data.

When thinking specifically about exposing the codec, you know a lot about a user if you can see enough esoteric codecs for encrypted data. Remember, there are codecs for ethereum, bitcoin and git, that are not “generic” the way dag-json and dag-cbor are. Seeing those codecs gives you insight in to applications and use cases that user is engaging in even if the data is encrypted.

oed commented 4 years ago

This means that it’s not all that suitable for IPLD because we need to encrypt graphs of arbitrary formats. The most obvious example is raw blocks.

I get that it might not be suitable for the standard way of encrypting data in IPLD. However I still feel it would be very beneficial to support as a specific way since it already has a lot of momentum within other communities.

If this is the case, we are going to need new codecs even for the basic stuff because we’ll want to decode that signed data to make it traversable.

Yes, this was what I was trying to convey with my initial post :)

JSON’s lack of binary support is forcing some incredibly bad performance issues here. It’s unavoidable

Yep, that's understandable. We ideally want to just use COSE / CBOR but the maturity of that standard is not at a stage where we are comfortable using it yet. Hopefully we can migrate to that easily in the future.

Anything you expose about encrypted data is a vector of attack.

For sure, exposing information about algo etc. seems like common practice though. Should be secure if the cipher is secure. Regarding your point about the shape of dag structures. Yes it will for sure leak information and the replication key makes a lot of sense (I believe this is the same thing that Textile does?). To me this seems like a matter of trade offs and different applications will have different needs.

I get the point about esoteric codecs. I don't believe that COSE/JOSE will be that :)

vmx commented 4 years ago

I get that it might not be suitable for the standard way of encrypting data in IPLD. However I still feel it would be very beneficial to support as a specific way since it already has a lot of momentum within other communities.

This is also how I understood the intial post. This is not about having the ideal signing/encryption standard, but trying to see if existing standards can work when using IPLD as a intermediate layer. I think IPLD should support such use cases.

mikeal commented 4 years ago

I get that it might not be suitable for the standard way of encrypting data in IPLD. However I still feel it would be very beneficial to support as a specific way since it already has a lot of momentum within other communities. This is also how I understood the intial post. This is not about having the ideal signing/encryption standard, but trying to see if existing standards can work when using IPLD as a intermediate layer. I think IPLD should support such use cases.

Let me clarify. My comment about suitability was more in response to Rod’s comment about potentially adopting these as the starting point for a more generic encryption standard for all of IPLD.

mikeal commented 4 years ago

Ok, let’s try to move this forward.

The low hanging fruit here is adding codecs for signed data. Those are doable today without any spec or stack changes in IPLD. Encryption support will require changes elsewhere but we can do the work to fully decode signed data today at the block level.

There’s two approaches we could take.

One is to call it dag-jose-signed? Again, we really can’t do much right now at the codec layer for encrypted data anyway. Also note that you could still encode encrypted data with dag-jose-signed, it just wouldn’t be able to decode anything at the block layer and would pass everything up like it was regular dag-json.

Another option is to call it dag-jose and when we hit encrypted data we could at least do the work of decoding the base64 on the json side so that we hand back a proper buffer. This might make a unified decryption layer in readers easier, since this should make the structure between encrypted JOSE and COSE identical.

oed commented 4 years ago

That sounds great!

Agree that it makes sense to start with signed data. I do think it makes sense to follow closely with encrypted data using your second option. This gives us a great way of dealing with encryption at the application layer. Hopefully in the future it may be handled by ipld, but understand that that is not feasible to do any time soon.

carsonfarmer commented 4 years ago

Glad to see this conversation progressing. I think @mikeal's second proposed approach is preferable from our perspective as well, so happy to support and contribute to that work!

mikeal commented 4 years ago

I was talking to @vmx today and realized that we would really benefit from having a call about these codecs.

We don’t know all the use cases for codecs and there seem to be some difficulties that we have here reconciling IPLD Data Model with the codec which shouldn’t really be a blocker since multiformat codecs aren’t bound by the Data Model, the Data Model is just there to find the right representation. I think there’s something here for us to learn so I’d like to just have a call about it so we can really dig in.

mikeal commented 4 years ago

@jonnycrunch ^^

carsonfarmer commented 4 years ago

The timing is good for this call as well, because we are getting close to starting work on this (type of) thing. What's the best way to coordinate/schedule something like this? Sooner rather than later would be good from our (@oed and @carsonfarmer) perspective.

jonnycrunch commented 4 years ago

@mikeal thanks! let me know.

oed commented 4 years ago

Happy to jump on a call @mikeal. And sorry I've been meaning to get back to the spec PR, but things have piled up. If possible, a meeting this or next week would be great!

mikeal commented 4 years ago

looking at next week, can ya’ll DM me your email addresses for this calendar invite :)

burdiyan commented 4 years ago

For signing data we've ended up implementing something similar to how Perkeep signs JSON but with CBOR. I think the way Perkeep goes about signing JSON is a fantastic approach which is a lot simpler that trying to implement JOSE, which indeed would require to implement a new code to make any sense of the signed data besides it being opaque bytes.

Here's a great explanation and rationale from Perkeep: https://perkeep.org/doc/json-signing/

For CBOR it's a bit more complex because you'd need to modify the CBOR headers and so on, so what we ended up doing is having a wrapper struct with data and signature fields. So we serialize the CBOR data, sign it, and then "embed" the valid CBOR map inside the data "manually". So in the end we don't do double serialisation, and we also keep it as a valid CBOR with data being valid structured CBOR as well, not opaque bytes.

oed commented 4 years ago

@burdiyan Main benefit of using JOSE/COSE is that it's a well used standard (IETF) with lots of tools available and a large existing community. The main benefit with these codecs is interoperability.

burdiyan commented 4 years ago

@oed I totally get your point. I think I haven't made myself clear that this was just my attempt to reveal some ideas about what currently users have to deal with due to the lack of official codecs and implementation for signing IPLDs. Coz getting COSE and even JOSE right is a hard thing to do. And the lack of codecs would break the interoperability.