keylime / meetings

Keylime meeting notes
1 stars 7 forks source link

Keylime New Architecture Design Meeting (TDB) #64

Closed THS-on closed 10 months ago

THS-on commented 1 year ago

Attendees

Topic

This meeting is intended to discuss on how we implement the architecture changes in Keylime

stefanberger commented 1 year ago

Since I wasn't able to attend the last meeting I (or maybe still 'we') would need to first understand what the other attestation protocols are doing and what they have in common, if anything, so that we can build an architecture around it.

THS-on commented 1 year ago

@stefanberger yes agree. The current idea is to start with TPM and SEV attestation report and build it from there. I'll try to add more details to this issue soon.

stefanberger commented 1 year ago

@THS-on Is this page a proper description of SEV Remote Attestation? https://enarx.dev/docs/technical/amd-sev-attestation

THS-on commented 1 year ago

@stefanberger the general flow should be, but there it seems that there is some enarx specific parts in there.

I found the sev-guest example generally helpful: https://github.com/AMDESE/sev-guest/blob/main/docs/ssh-key-exchange.md

For the proof-of-concept vTPM inside SEV-SNP (https://github.com/svsm-vtpm/linux-svsm/blob/svsm-vtpm-preview/README-vtpm.md) implementation, the flow glued on top of Keylime can be found here: https://arxiv.org/pdf/2303.16463.pdf

maugustosilva commented 1 year ago

Typing up here a few thoughts about the new architecture. Not fully fleshed out, but I do consider that each of these itens could be worked on independently, unless stated otherwise:

1) A registrar can automatically add nodes to a verifier 1.1.) A "quick and dirty" (yet useful) way of implementing it would to have a parameter that allows the registrar to call the tenant code at the end of registration 1.2.) A more permanent possibility would be to just rework the registrar to perform most of the tenant functions directly. Maybe eliminating the concept of the tenant altogether (although I am not entirely sure of it). Please consider possible interplay with item 11

2) Add the ability to "attest only a certain number of times and then stop". After the limit of attestation is reached, the agent is put on TERMINATED state, instead of FAILED 2.1.) We have an interesting use case where a customer wants to boot with the agent running on a RAMDISK, and then, during the initial adding to a verifier, a tenant sends a payload containing a key to decrypt the root filesystem on disk. On this scenario, the attestation can stop after this first delivery is done.

3) Separate the actual validation of artifacts sent by an agent from evaluation of its contents: 3.1) The main verifier code path should only validate: 3.1.1) execute tpm2_checkquote 3.1.2) "replay" Measured Boot (MB) Log and compare values for PCRs [0-9] and [11-14] 3.1.3) "replay" Integrity Measurement Architecture (IMA) Log and compare values for PCR 10 3.2) Evaluate the TPM policy by checking the values of other PCRs (15-23) 3.2.1) Make the TPM policy more flexible, including not only a list of values for specific PCRs but also information from TPM clock (e.g., number of reboots) 3.2.3) In case "extended" attributes are sent by the agent, such as "Confidential Compute Attestation Report", there must be a plugin architecture - very much like Durable Attestion (DA) "backends" - that can evaluate its contents 3.3) The MB policy engine becomes a plugin, again, DA "backends" as the model here. 3.3.1) Apply policy to the (already validated) MB log 3.3.2) The currently existing policy engine mechanism becomes de "default implementation" 3.3.2) External policy engines (e.g., seedwing, OPA) can be called with this mechanism 3.3.3) Important debugging feature: a keylime_boot_policy_run that takes a policy plugin and a MB log as arguments and returns an error. 3.4) The IMA policy engine also becomes a plugin (same arch of other plugins). 3.3.1) Apply policy to the (already validated) IMA log 3.3.2) The currently existing policy engine mechanism becomes de "default implementation" 3.3.2) External policy engines (e.g., seedwing, OPA) can be called with this mechanism 3.3.3) Important debugging feature: a keylime_runtime_policy_run that takes a policy plugin and an IMA log as arguments and returns an error.

4) Add also a plugin architecture to deal withe "Confidential Computing" attestation reports. There are some work done by other teams in "Attestation as a Service" to "confidential containers" and we (@maugustosilva @galmasi) have very good working relationship with members of the team developing this. Their "Key Broker Service" has a well-defined API that we could be considering for use (https://github.com/confidential-containers/kbs/blob/main/docs/kbs_attestation_protocol.md)

5) Add also a plugin architecture to deal with attestation failures: the verifier can invoke these plugins for specific agents in case of failures.

6) "Push model": an agent talks to a registrar to get its EK/AK on a database, it is automatically added to a verifier (see item 1) and receives the contact information from such verifier. From this point on, it can periodically send attestation artifacts to be verified. 6.1.) The contact information can be pulled by the agent after successful adding to a verifier, or could be sent by either verifier or registrar 6.2.) Please note that there has not been much work on previous proposal https://github.com/keylime/enhancements/issues/60 and https://github.com/keylime/enhancements/issues/57.

7) Support for different kinds of signing and encryption keys (in addition to EK/AK), such as IDevID, IAK and SRK. 7.1.) HPE plans to open a PR for IDevID, IAK

9) Remove the use of all tpm2-tools command line invocations from the registar and verifier code. 9.1.) We need to replace the use of tpm2_makecredential on the registrar 9.2.) We need to replace the use of tpm2_eventlog on the verifier. There is undergoing effort to do so at https://github.com/keylime/python3-uefi-eventlog 9.3.) We need to replace the use of tpm2_checkquote on both the verifier and the tenant 9.4.) We need to replace the use of tpm2_print on the verifier

10) Redesign of main event loop on verifier: goal here is to make it more scalable.

11) Currently, registrar does not use the tornado framework. Consider using it.

stefanberger commented 1 year ago
  • A registrar can automatically add nodes to a verifier 1.1.) A "quick and dirty" (yet useful) way of implementing it would to have a parameter that allows the registrar to call the tenant code at the end of registration 1.2.) A more permanent possibility would be to just rework the registrar to perform most of the tenant functions directly. Maybe eliminating the concept of the tenant altogether (although I am not entirely sure of it). Please consider possible interplay with item 11

What is tenant here? Our keylime_tenant tool that has all these commands and many options with IP addresses, runtime policy etc.?

  • Add the ability to "attest only a certain number of times and then stop". After the limit of attestation is reached, the agent is put on TERMINATED state, instead of FAILED 2.1.) We have an interesting use case where a customer wants to boot with the agent running on a RAMDISK, and then, during the initial adding to a verifier, a tenant sends a payload containing a key to decrypt the root filesystem on disk. On this scenario, the attestation can stop after this first delivery is done.

What is 'a tenant' in this context?

edwards-n commented 1 year ago
  1. 1.2.) A more permanent possibility would be to just rework the registrar to perform most of the tenant functions directly. Maybe eliminating the concept of the tenant altogether (although I am not entirely sure of it). Please consider possible interplay with item 11

@maugustosilva An interesting suggestion. Simplification is good if it doesn't break important use cases. Did you intend to include an item 11? I don't see it.

maugustosilva commented 1 year ago
  • A registrar can automatically add nodes to a verifier 1.1.) A "quick and dirty" (yet useful) way of implementing it would to have a parameter that allows the registrar to call the tenant code at the end of registration 1.2.) A more permanent possibility would be to just rework the registrar to perform most of the tenant functions directly. Maybe eliminating the concept of the tenant altogether (although I am not entirely sure of it). Please consider possible interplay with item 11

What is tenant here? Our keylime_tenant tool that has all these commands and many options with IP addresses, runtime policy etc.?

  • Add the ability to "attest only a certain number of times and then stop". After the limit of attestation is reached, the agent is put on TERMINATED state, instead of FAILED 2.1.) We have an interesting use case where a customer wants to boot with the agent running on a RAMDISK, and then, during the initial adding to a verifier, a tenant sends a payload containing a key to decrypt the root filesystem on disk. On this scenario, the attestation can stop after this first delivery is done.

What is 'a tenant' in this context?

Right, in the original Keylime's architecture, tenant was a fourth component (in addition to registrar, verifier and agent) which would be in charge of "closing the loop" in terms of initiating an attestation by requiring the explicit pairing of an agent to a verifier. It was done so due to the fact that Keylime was envisioned to be used in a "shared attestation infrastructure", where multiple "tenants", without any mutual trust had to co-exist on the same registrar and tenant.

So, yes, this "component" is the one which executes keylime_tenant commands.

maugustosilva commented 1 year ago
  1. 1.2.) A more permanent possibility would be to just rework the registrar to perform most of the tenant functions directly. Maybe eliminating the concept of the tenant altogether (although I am not entirely sure of it). Please consider possible interplay with item 11

@maugustosilva An interesting suggestion. Simplification is good if it doesn't break important use cases. Did you intend to include an item 11? I don't see it.

Very good point! It is precisely why I was considering the (IMO) simple case where we allow the registrar to simply behave as a "in loco tenant" and simply call the same code path. What I am not proposing here is that we formally remove the tenant from the architecture, since there are important use cases where this will be needed.

edwards-n commented 1 year ago

The "in loco tenant" model is better for our use cases. If I understand it correctly we wouldn't have to run the tenant.

maugustosilva commented 1 year ago

The "in loco tenant" model is better for our use cases. If I understand it correctly we wouldn't have to run the tenant.

You did understand correctly. The registrar would automatically add a new agent to a verifier and an admin wouldn't have to even know about tenant anymore.

stefanberger commented 1 year ago

The "in loco tenant" model is better for our use cases. If I understand it correctly we wouldn't have to run the tenant.

You did understand correctly. The registrar would automatically add a new agent to a verifier and an admin wouldn't have to even know about tenant anymore.

... choosing which runtime policy for the monitored machine?

ansasaki commented 1 year ago

Since we are discussing the architecture redesign, I would like to propose some ideas for consideration:

1 - To support multiple attestation mechanisms/roots of trust other than only the TPM, we could change the registration step for the agent to inform the list of types of quotes it can generate, and also the supported mode of operation (push or pull). If the list is not provided, it is assumed the agent can only generate the current format of quote. This will allow backwards compatibility, and also heterogeneous agents running.

2 - Based on the capabilities declared by the agent, the verifier could select which types of quote it wants from the agent by setting new parameters in the GET request. If nothing is set, then it is assumed the current format of quote is requested, again keeping backwards compatibility.

3 - The message format for the quote should be attestation mechanism agnostic, able to contain any kind of data. The current quote message format could be one of the supported types, something like:

{
    "type": "TPM",
    "quote": <base64 encoded JSON string containing the current quote message format>
}

The agent response could contain a list of requested quotes. If not, then it is assumed the current quote format is expected, keeping backwards compatibility.

4 - We should consider redesigning the keylime protocol to be post-quantum crypto ready and FIPS compliant. Of course PQ ready TPM hardware will take years (decades?) to be available, but I'm suggesting we make keylime flexible where we can (e.g. keys used for TLS).

5 - The exchange of U and V keys encrypted by an RSA key is mandatory. When the mTLS is enabled, the data is transmitted through a trusted encrypted channel, with mutual authentication. It shouldn't be necessary to encrypt U and V in this case.

6 - mTLS is difficult to be configured due to the certificates distribution complexity. We should consider supporting other authentication mechanisms that scale better and simplify the deployment.

edwards-n commented 1 year ago

The "in loco tenant" model is better for our use cases. If I understand it correctly we wouldn't have to run the tenant.

You did understand correctly. The registrar would automatically add a new agent to a verifier and an admin wouldn't have to even know about tenant anymore.

... choosing which runtime policy for the monitored machine?

Good question there are options. Having a default in the verifier, allowing the agent specify as part of registration and the tenant overriding the default.

stefanberger commented 1 year ago

Good question there are options. Having a default in the verifier, allowing the agent specify as part of registration and the tenant overriding the default.

How is the initial registration, which is typically done via keylime_tenant -c add, to be kicked off in this scenario?

edwards-n commented 1 year ago

How is the initial registration, which is typically done via keylime_tenant -c add, to be kicked off in this scenario?

I thought the idea was that the registrar would do it automatically.

stefanberger commented 1 year ago

How is the initial registration, which is typically done via keylime_tenant -c add, to be kicked off in this scenario?

I thought the idea was that the registrar would do it automatically.

The registrar would have to know the IP address of host that's going to be monitored. How does it get that?

Regarding a default policy: If all the systems were the same there could be a default policy for monitoring the TPM and IMA logs but I think most environments won't have a collection of homogenous systems but they will be different and then each system will need its own policy. I don't think a default policy of monitoring 'nothing', which would apply across non-homegenous systems, would help much.

maugustosilva commented 1 year ago

These are good questions @stefanberger so, allow me to offer my experiences deploying Keylime in prod (the mileage will most certainly vary, but it is at least one data point from a Cloud production environment).

1) Even in our environment with highly heterogeneous nodes, we end up with a single measured boot (MB) and runtime (IMA) policies 2) The registrar component has access to /etc/keylime/tenant.conf, and on this file, three important pieces of information are present a) Verifier IP/Port b) "default" MB policy c) "default" IMA policy

Finally, as you said yourself, even if we resort to an "auto-add is just for "node identity", no MB, no IMA" it would be already useful (it is at the very least a "pipe cleaner" to detect verifier <-> agent communication problems).

An additional point: I currently don't see a lot of heterogeneity on TPM policy (after all, in "full attestation", PCRs 0-9 and 11-14 are already "taken") but not saying that this will never be a problem.

stefanberger commented 1 year ago

I guess it depends on how strict/custom tailored your policies are for those highly heterogeneous nodes that a common denominator policy may or may not exist.

stefanberger commented 1 year ago

Finally, as you said yourself, even if we resort to an "auto-add is just for "node identity", no MB, no IMA" it would be already useful (it is at the very least a "pipe cleaner" to detect verifier <-> agent communication problems).

Where do errors then go to? To log files?

THS-on commented 10 months ago

Closing because the new push model proposal is now being implemented.