tkhq / qos

QuorumOS is a computation layer for running applications inside Trusted Execution Environments (TEEs)
GNU Affero General Public License v3.0
5 stars 2 forks source link

Investigate: Completely and programmatically reboot enclaves after sensitive operations #126

Open emostov opened 2 years ago

emostov commented 2 years ago

Once daisy-chain-boot is implemented, an enclave could be rebooted programatically. For example, once an enclave completes a sensitive action it can reboot itself and make a daisy-chain-boot request to another enclave to get the QuorumKey.

The upside here is that " losing access to a compromised VM after every action because it reboots while discarding all changed state and memory looks pretty daunting when it comes to remote code execution attacks that need to achieve some persistence." (@cr-tk).

However the downsides to consistent full daisy-chain-reboot are numerous:

1) computationally costly calls to constantly re-encrypt the requeseted Quorum Key to a new ephemeral key

1) potentially costly calls to NSM for attestation

1) latency, especially if most enclaves are rebooting at the same time

1) perhaps the biggest concern "a system that constantly encrypts and decrypts and signs with important keys and over important keys (multiple times per second in the whole cluster, perhaps) offers a lot more opportunities for attackers to listen for side channels or mount some attack that only succeeds every 10⁶ times or so (example number)." (@cr-tk)

Generally speaking, "we probably want to keep the relevant operations low if we can, because most CPUs are pretty bad at hiding what they're doing at the electrical level, or ensuring that all cryptographic computation steps are actually correct." ... "a practical setup could include some intentional tradeoffs. In theory, we could leave it up to the client if they want high signing-operation-per-second numbers or live with the limitations, clear after action X but not after Y, clear anyway every Z hours, and so on. However, I think it'll complicate the system design properties to have this flexibility. An enclave reboot or wipe operation will result in latency spikes, and the resulting cluster may e.g. have an unexpected temporary dip in signing capacity if most signers happen to reboot or wipe at the same time." (@cr-tk)

ref: https://github.com/tkhq/qos/issues/122

cr-tk commented 2 years ago

A note on the remote code execution scenario: modern security systems and secure boot-related mechanisms make it hard for attackers to achieve persistence of their code execution, for example across the reboot cycle of an iOS based smartphone. That is useful from a defense perspective since it limits the blast radius and gives end users a realistic way to recover back into a trustworthy system. For example, a compromised QOS instance isn't affected across reboots if the compromise happened during operation and not due to a malicious initial software that was booted, assuming whatever exploited the instance hasn't happened again yet.

In the case of QOS, it's necessary to ask how much the blast radius is actually reduced in practice if a QOS instance is fully compromised for the duration of one operation. In the above described scenario, a rebooted instance that goes back to a non-compromised state will no longer be able to manipulate what is going to be signed - in other words, it reverts back to a trustworthy state. However, if it only takes one compromised instance and one signing-based exfiltration (or general socket-based communication with a malicious coordinator component) of the main QuorumKey, this blast radius reduction and restoration of a trusted software state may not be worth as much if the external attacker can now start generating arbitrary signatures (!) with the key that was leaked.

So I think we should have a good understanding of why we're taking on certain tradeoffs like costly reboots and keep in mind scenarios where this doesn't help.