paritytech / substrate

Substrate: The platform for blockchain innovators
Apache License 2.0
8.39k stars 2.66k forks source link

Add remote signing to substrate client #4689

Open wpank opened 4 years ago

wpank commented 4 years ago

The following proposes the addition of remote signing functionality within the substrate client.

Context

Security of Proof of Stake networks lie within the hands of validators - without the security these entities provide, the whole system falls apart. The responsibility of a validator is to operate stable, reliable, consistent, and secure operations of their nodes. This responsibility also includes managing their signing keys, keys that let the network know they were the ones that verified that the activity they put on the network is non-byzantine.

As a validator, the current paradigm of storing hot session keys in the client leaves much to be desired in terms of security. Although session keys cannot lead to direct access of funds, a compromise of the validator host (and the session keys within it) can lead to a complete loss of funds for a validator and the funds of those nominating them. Furthermore, there is a greedy incenctive to compromise these keys, as up to 10% of the slash can get rewarded to those who report it. While key rotation helps mitigate this to an extent, a more elegant solution of key storage and signing will be required in the long run.

Separating out the storage and signing interface of session keys from the validator host client would allow validators to create more robust and flexible operations, while providing additional layers of defense against possible attack vectors. A full compromise of the validator host shouldn't enable conditions where the validator can be slashed. Separating out the storage of session keys would mean adding the ability to have a remote signing interface, which gives a flexible means of having a remote signing server - one which ideally has double signing protection and HSM, TEE, Ledger, and TSS support. This addition increases the cost of compromising validator operations, something that creates a more resilient and secure network in the long run.

Remote Signing Server

The following proposes the approaches of one remote signing server, although the interfaces exposed by the substrate client should allow for multiple implementations to exist. The signing server proposed here would live as a rust module in a separate repository - these considerations are for reference and context.

A remote signing server should be flexible to account for a diversity of key management approaches, including TEE, HSM, cloud HSM, Ledger, and encrypted software based key storage. Additionally, the remote signing server should be able to support multiple substrate based chains. This essentially acts as a single API for all key management and signing.

Approach

The signing server should run as a separate process on a physical on-premise host, although cloud based should be considered as well (although is less preferred).

An approach would be to have the remote server have an inverse connection where the remote signer makes an outbound encrypted connection channel to the validator host listening at the multiaddr URL specified by the substrate client cli flag --keystore-server <URL>. The remote signer would not be open to any outbound traffic, reducing it's attack surface. It's the signer's responsibility then to keep the connection open to the substrate client. After making an initial connection, the remote signing server listens for RPC requests from the validator host, handles them by creating the appropriate signature or payload, and sends the response back to the validator host.

RPC API Spec

Requests and responses from the substrate client to the signing server should be tagged appropriately to differentiate how and what to sign. These would be specific to the module that is requesting them, such as GRANDPA or BABE.

One could imagine the following types of RPC requests/responses:

The specifics of these should be a point of discussion as how to minimize the changes needed in the substrate client.

Configuration

Configuration of the signing server can be done via a config file that gets loaded upon starting the remote signing server. As one design goal is to have flexible ways of storing keys, this will be used for specifying the key provider (what is storing the keys), type of key, validator host, and so forth.

The following is a non-exhaustive list of some possible configuration parameters:

CLI

The remote signing server would likely have a cli interface for setup, debugging, and deployment.

One could imagine the following possible commands:

Key Providers

The following describes some key providers and some benefits and trade offs they may provide.

HSMs

HSMs, or hardware security modules, allow you to store keys in a secure manner within hardware. They use tamper proof secure elements that prevent key extraction and allow payloads to be signed without ever exposing the private keys to the host. Since the generated keys never leave the device, even if the validator host is compromised, an attacker would not be able to access these keys.

One issue with most HSMs, however is that they are dumb signing oracles. It will sign whatever it recieves without verifying it. Thus this alone doesn't provide much security compared to soft signing in terms of equivocation. If the validator host is compromised, an attacker can still request a signature, however they cannot extract the keys themselves. This approach is thus most useful with a remote signing server that also has double signing protection.

TEE

A remote signer operating within a TEE such as SGX or Trustzone gives increased security compared to filestore based storage.

Here's one approach as to how this can be used in this type of situation.

Ledger

Ledgers work very well amidst HSM-like solutions, as they are programmable (and thus double signing protection can be built into the software). They are also cheap, highly available, and easily accessible. In production datacenters, these can work surprisingly well.

Substrate Client

One would need to modify the Substrate client to account for fetching keys and signatures externally.

A first thing that needs to be done is implment an RPC server for sending and fetching requests. This would involve either creating a new module, keystore-server, or modifying the keystore module to include this.

The RPC server would start to run when additional cli flag is given to a substrate client, --keystore-server <CONNECTION_SECRET>. When this flag is given, the RPC server well begin to listen for a request from the remote signing server to initiate a handshake. CONNECTION_SECRET will be needed to start the handshake, and from an operators perspective, this should be handled with a secrets management service like Hashcorp Vault. Additionally, another flag, --keystore-server-url <URL> could be specified as a specific url or port that the RPC Server listens on.

If the subrate node is started with the --keystore-server flag enabled, it would wait until a handshake is made before it starts producing and finalizing blocks.

Additionally, changes would need to be made to the substrate client to change how keys are fetched and signatures created compared to how it exists currently. One approach here would be to modify the keystore in the client to contain abstractions over this happening in either the client or fetching them from the remote server. This would contain the interface that both the client signer (perhaps within the keystore) or external signer implements. Either a new keystore-server or modified existing keystore will have the responsibility of generating requests needed to send to the external signing server. Changes in the consensus modules will need to be made to delegate the creating of those requests to keystore/keystore-server.

Double Signing Protection

Although adding a remote signer can add a layer of security compared to the current status quo, if the validator host were to be compromised, the attacker can still initiate a double sign by invoking the remote signing server. In order to mitigate this, double signing protection should eventually get built into the remote signing server. If the substrate client is compromised, the signing server should be able to prevent equivocation, or anything that ends in the corresponding extreme level of slashing for the validator.

In order to do this, the remote signing server would need to keep track of state as to not be able to produce or finalize conflicting blocks.

In Tezos, double signing protection is done by keeping track of a high watermark for endorsements and block headers. The high watermark is the highest level to have been baked so far and no block header or endorsement will be signed at a lower block level than the previous block or endorsement.

In Cosmos, this is done by keeping track of the last Height, Round, Step (HRS). When trying to sign a new block, it will only sign any that have a higher HRS.

Thus, the following will need to be constructed individually:

High Availabilty

Having both remote signing as well as double signing protection can help give way to high availabilty (active/active) type setups that would increase the resiliancy of the network and validator operations. One possibility this unlocks is a MPC ha keystore server with m of n threshold based signatures required to produce the signature to the validator host. This depends on #11, but ultimately creates an extremely robust setup where the cost and opportunity to compromise a validator becomes substantially lower than the current status quo.

v1

A first version of this would have minimal functionality at first, likely using session keys like they are now, but isolated within a remote signing server. HSM interfaces as well as double signing protection should be next steps.

Discussion

brenzi commented 4 years ago

We have proposed a solution for this based on Intel SGX TEEs: https://github.com/w3f/Web3-collaboration/pull/234

Remarks to your OP:

burdges commented 4 years ago

We're about to do a new VRF, likely called VRedJubJub, that'll we'll need to support as well, and of course BLS signatures, but they do not add as much complexity here, but of course legder devices cannot produce SNARKs and maybe cannot do BLS signatures.

Noc2 commented 4 years ago

Just for your information: Zondax is receiving a grant from us to work on a flexible TrustZone-based HSM stack

gnunicorn commented 4 years ago

Changing this in substrate will be very involved, as it introduces a completely different pattern of what the keystore is and how it works. They way it works right now is, that the keystore is a single entity in the system (either in memory or saved on disk), holding different types of keys for different tasks. When a component needs to sign something it asks the keystore for the appropriate keys and uses them to sign the data. Meaning this is a direct, non-blocking API and in doubt the keys holds all information for signing directly in memory–though discouraged, you can keep the key around and reuse it.

This however, proposes a completely different approach how signing works. Rather than the keystore holding the keys, you'd have to submit something you'd like to have signed to it and wait for that to return. Making it an async and indirect API. While not impossible, a range of crates depend on the keystore directly and a range of others imply this pattern (e.g. GRANDPA). Switching these is a pretty large task, touching a lot of code, many of which are sync right now and would become async as a result, with –probably– a big tail of things to have to change in responds to that ;) .

burdges commented 4 years ago

We'd prefer doing this by features, not adding some new Signer trait to every substrate crate, right? We've no roadmap for async fns in traits of course, but this holds even if async fns in traits worked, right?

bkchr commented 4 years ago

I don't think that it will be that involved on the Substrate side of signing. It is right that we need some changes here and there. However, aura, grandpa and babe are already async. The trait can just return a Future as result and we wait for the signing. By default with no remote signing the api would be blocking and return directly the signed data.

Offchain signing (imonline) shouldn't also be that hard, we need to call into the host anyway and use block_on to wait for the future, like we do it for http requests.

As everything uses the Keystore behind a trait already, it should really be not that hard to integrate.

burdges commented 4 years ago

I donno if https://github.com/iqlusioninc/armistice is relevant, but maybe good to track if you'd working on this stuff

Demi-Marie commented 4 years ago

Changing this in substrate will be very involved, as it introduces a completely different pattern of what the keystore is and how it works. They way it works right now is, that the keystore is a single entity in the system (either in memory or saved on disk), holding different types of keys for different tasks. When a component needs to sign something it asks the keystore for the appropriate keys and uses them to sign the data. Meaning this is a direct, non-blocking API and in doubt the keys holds all information for signing directly in memory–though discouraged, you can keep the key around and reuse it.

This however, proposes a completely different approach how signing works. Rather than the keystore holding the keys, you'd have to submit something you'd like to have signed to it and wait for that to return. Making it an async and indirect API. While not impossible, a range of crates depend on the keystore directly and a range of others imply this pattern (e.g. GRANDPA). Switching these is a pretty large task, touching a lot of code, many of which are sync right now and would become async as a result, with –probably– a big tail of things to have to change in responds to that ;) .

Some implementations might actually be synchronous, such as those based on an on-chip TEE.

jleni commented 4 years ago

I think keystore and signer should be two different independent entities. Actually the concept of a software-based keystore may not always be required.. Substrate should ideally deal with a signer only. This signer may later rely on a keystore or not.

My recommendation is to aim for an asynchronous design to cope with latency issues. Even in the case of fast TEEs, it can affect performance if signing operations require context switches, syscalls, etc.

With respect to the work we did at Zondax in Tendermint, yes we used the HSM in Ledger devices (deserializing votes, checking with a monotonic counter, etc.). Latency in these devices is in the order of tens of milliseconds so an asynchronous approach was very important regardless of running in-process or remotely.

We are now working on a completely new design for Kusama/Polkadot/Substrate with a very much hardened datacenter-quality external device, running in a TEE plus in some models we even have access to an integrated HSM. While running a "lean" node would be possible, it means adding a bigger attack surface that we strongly would like to avoid.

Anyway, I am not sure if this discussion is still active.. though having seen the changes here https://github.com/paritytech/substrate/pull/4925/files. I think a good and quick step forward would be to: 1- Decouple the keystore implementation using two traits (i.e. signer, and keystore) 2- Move the current keystore implementation to another crate

This way interested parties can provide clean alternative implementations.

3- Ideally make sign_with async.. however, I understand that it may require substantial work and you prefer to avoid it for now.

jleni commented 4 years ago

IMO, once the signer/keystore have been fully decoupled and made async.. third-party implementations can define their own API, comm protocol, in-process vs remote approach, etc. I think this is the most flexible approach.

There is still one more complex but important issue. At the moment, signers operate on blobs, so they cannot really know what it is being signed. In some cases, signers may even received hashes of the actual content. This severely limits how smart a signer can be.. meaning, it is not possible to track and design adequate double signing protection schemes.

I would need to dig more into the current substrate implementation, but I wonder if there are a few convenient places that could be extended to provide more information at the moment of signing or this is at the moment scattered all over the code.

Otherwise, I can already see that, at least from my project perspective, the keystore is actually not the point where we need to plug-in but just before GRANDPA/BABE/etc decide to sign and still have an structured object.

brenzi commented 4 years ago

@jleni: Knowing and understanding what you sign is not enough. You need to be able to verify the payload. In the case of signing a block, you need to verify that the block is fresh and legit (builds upon HEAD and executes correctly) and you should prevent double signing. This means you need to know a lot of context before signing. We have previously elaborated on this.

jleni commented 4 years ago

Well, yes.. I understand this. I think we are talking about the same thing here. It can be enough in some specific cases like double sign tracking.

What we initially did in Tendermint (already a few years ago) is what you call Replay Attack Mitigation in your document. We used a Ledger device with a custom app that could deserialize votes and keep a relatively small state. We kept block verification on the validator node.

I was actually trying to ask for a small incremental step related to this particular issue. If the scope is block verification, things are different and you need a very good implementation for a light client.. I think this it out of scope in this issue.

High availability is something more complex and there are many options. I would initially discourage using JSONRPC, but that is part of another discussion. What I would definitely recommend is again, decoupling the signer/keystore, move it to another crate and allow external implementations to override this. There are multiple solutions, more advanced transport options, etc.

The best option is to create a simple and separate reference crate with the code that already exist and leave this open to further third-party improvements.

Demi-Marie commented 4 years ago

@brenzi @jleni this seems like a perfect use-case for a formally-verified microkernel, such as seL4. The microkernel could provide software-based isolation between untrusted components, such as the network stack, and trusted components, such as the signer implementation.

One major caveat is that the main framework that I know of for using seL4, CAmkES, only supports systems where all resources are statically allocated. Ideally, the trusted code should not use dynamic memory allocation, but I am not sure if this is practical.

brenzi commented 4 years ago

@DemiMarie-parity Very interesting! But wouldn't this require self-hosted signer HW? Even if cloud services would offer SeL4 VPS, why would you trust them? They still have access to all memory. Am I missing something?

Demi-Marie commented 4 years ago

@DemiMarie-parity Very interesting! But wouldn't this require self-hosted signer HW? Even if cloud services would offer SeL4 VPS, why would you trust them? They still have access to all memory. Am I missing something?

@brenzi You are not. That is one reason why self-hosted signer hardware should be preferred. The biggest caveat is that not everyone can provide the level of physical security required, and most cannot provide the needed protection against DDoS attacks. Could @kirushik chip in?

Using seL4 has a few caveats:

Demi-Marie commented 4 years ago

To elaborate: From my perspective, the only advantage of a TEE and/or HSM is protection against attackers with physical access. I believe that equally important, if not more important, is privilege separation a la QubesOS. While Substrate is a substantial attack surface, we can remove much of the rest.

Demi-Marie commented 4 years ago

Working with QubesOS and Redox might be a good idea as well.

jleni commented 4 years ago

To elaborate: From my perspective, the only advantage of a TEE and/or HSM is protection against attackers with physical access. I believe that equally important, if not more important, is privilege separation a la QubesOS. While Substrate is a substantial attack surface, we can remove much of the rest.

I disagree with this, TEEs do not have much to do with physical access. Both TEEs and HSMs can provide different (better?) guarantees than QubesOS (basically a Xen hypervisor without ASLR or NX).

I will not write extensively here, to avoid going off-topic, given this issue is mostly about providing an API for teams to provide their preferred security solution. Happy to organize or a Riot channel about this though!

Nevertheless, as there are MANY valid alternatives and approaches, I would strongly suggest to make the architecture as flexible as possible so different solutions can be integrated over time.

rakanalh commented 4 years ago

To advance this a bit further, especially after merging #4925, here's my line of thinking when it comes to implementing client support for remote signing:

pub trait Signer {
    fn supported_keys(
        &self,
        id: KeyTypeId,
    ) -> Result<Vec<CryptoTypePublicPair>, BareCryptoStoreError>;

    fn sign_with(
        &self,
        id: KeyTypeId,
        key: &CryptoTypePublicPair,
        msg: &[u8],
                at_blockhash: &[u8],
    ) -> Result<Vec<u8>, BareCryptoStoreError>;
}
/// Type of the client signer.
#[derive(Clone, Debug)]
pub enum SignerType {
    Local,
    RemoteClient,
    RemoteServer,
}

}


- `RemoteClient` type of signer is where the substrate node dispatches signing requests towards a specific host/port over a specific endpoint for signing, be it an HTTP(s) call or a gRPC or potentially other protocols.
- `RemoteServer`, on the other hand, tells the substrate node that it should open a port and listen for "secure" connections where the node can send signing requests over this connection.

- We could also abstract the protocol implementation into it's own trait so that additional protocols can be implemented on top of this where incoming / outgoing payloads can be encoded / decoded.

- To enable double-signing protection to be implemented by the server, it is suggested that the `sign_with` interface also adds `at_blockhash` parameter where the signing requests explicitly define the block hash at which signing should happen. This enables the server to query certain blockchain information such as block height or other parameters required.

- The interface is defined to be "sync" here but could use `block_on` to perform async operations. That is, until async support is implemented in certain parts of the substrate codebase.

I would like to get some feedback on the above to move this forward.
bkchr commented 4 years ago

Why do you want to introduce a new trait? The KeyStore trait is exactly meant for this, as abstraction over the key store.

RemoteServer should be an extra application and should not be included into the Substrate node!

You don't need to pass at_blockhash to the sign function. Based on the key type, you can decode the opaque blob that should be signed and this blob already contains all the information you need to prevent double signing.

rakanalh commented 4 years ago

Why do you want to introduce a new trait? The KeyStore trait is exactly meant for this, as abstraction over the key store.

You're absolutely right. After working on the code for a bit, it is apparent to me that the separation of Signer and Keystore doesn't make sense. I am reverting the work i did by keeping Keystore as-is and going to introduce RemoteKeystore which handles remote key management and signing.

rakanalh commented 4 years ago

You don't need to pass at_blockhash to the sign function. Based on the key type, you can decode the opaque blob that should be signed and this blob already contains all the information you need to prevent double signing.

Could you expand on this a bit please? how would the key type be relevant to the blob sent for signing?

bkchr commented 4 years ago

If you see the KeyType that is used by Babe, you can just decode the blob to the Babe specific structure. The same goes for Grandpa. Every key type makes it possible to identify the encoded blob to decode it.

burdges commented 4 years ago

I've increasingly realized that block seals should probably use the extra arguments to VRFs in https://github.com/w3f/schnorrkel/blob/master/src/vrf.rs not a separate signature, but not worth the effort required to change this since it'd only save 64 bytes per block.

Demi-Marie commented 4 years ago

@burdges I would love to see that change be made sooner rather than later, but I am not sure if it is practical right now. We can always make it at the next hard fork.

nicolasochem commented 1 year ago

Hi, when can we have a remote signer for substrate session keys?

I just re-read the description and everything is still very relevant. Let's prioritize this?

burdges commented 1 year ago

We've several major projects that shall further change the session key crypto: beefy, including optimized signing, sassafras, including ring VRFs and ephemeral block signing keys, new session certificates for shashing reform, post-quantum options, and equivocation prevention.

All development is path dependent.. It's possible if complex to implement remote signers for these after they're working, but it's impossible to implement & deploy these once everyone expects a remote signer.

FlorianFranzen commented 1 year ago

It should also be mentioned, that Zondax has a working external signer that allows the management of session keys inside of an ARMs TrustZone, however it seems Parity has not interest to support this officially yet (see #10423 for details).