Closed THS-on closed 12 months ago
Since I wasn't able to attend the last meeting I (or maybe still 'we') would need to first understand what the other attestation protocols are doing and what they have in common, if anything, so that we can build an architecture around it.
@stefanberger yes agree. The current idea is to start with TPM and SEV attestation report and build it from there. I'll try to add more details to this issue soon.
@THS-on Is this page a proper description of SEV Remote Attestation? https://enarx.dev/docs/technical/amd-sev-attestation
@stefanberger the general flow should be, but there it seems that there is some enarx specific parts in there.
I found the sev-guest example generally helpful: https://github.com/AMDESE/sev-guest/blob/main/docs/ssh-key-exchange.md
For the proof-of-concept vTPM inside SEV-SNP (https://github.com/svsm-vtpm/linux-svsm/blob/svsm-vtpm-preview/README-vtpm.md) implementation, the flow glued on top of Keylime can be found here: https://arxiv.org/pdf/2303.16463.pdf
Typing up here a few thoughts about the new architecture. Not fully fleshed out, but I do consider that each of these itens could be worked on independently, unless stated otherwise:
1) A registrar
can automatically add nodes to a verifier
1.1.) A "quick and dirty" (yet useful) way of implementing it would to have a parameter that allows the registrar
to call the tenant
code at the end of registration
1.2.) A more permanent possibility would be to just rework the registrar
to perform most of the tenant
functions directly. Maybe eliminating the concept of the tenant
altogether (although I am not entirely sure of it). Please consider possible interplay with item 11
2) Add the ability to "attest only a certain number of times and then stop". After the limit of attestation is reached, the agent
is put on TERMINATED state, instead of FAILED
2.1.) We have an interesting use case where a customer wants to boot with the agent
running on a RAMDISK, and then, during the initial adding to a verifier
, a tenant
sends a payload containing a key to decrypt the root filesystem on disk. On this scenario, the attestation can stop after this first delivery is done.
3) Separate the actual validation of artifacts sent by an agent
from evaluation of its contents:
3.1) The main verifier
code path should only validate:
3.1.1) execute tpm2_checkquote
3.1.2) "replay" Measured Boot (MB) Log and compare values for PCRs [0-9] and [11-14]
3.1.3) "replay" Integrity Measurement Architecture (IMA) Log and compare values for PCR 10
3.2) Evaluate the TPM policy by checking the values of other PCRs (15-23)
3.2.1) Make the TPM policy more flexible, including not only a list of values for specific PCRs but also information from TPM clock (e.g., number of reboots)
3.2.3) In case "extended" attributes are sent by the agent, such as "Confidential Compute Attestation Report", there must be a plugin architecture - very much like Durable Attestion (DA) "backends" - that can evaluate its contents
3.3) The MB policy engine becomes a plugin, again, DA "backends" as the model here.
3.3.1) Apply policy to the (already validated) MB log
3.3.2) The currently existing policy engine mechanism becomes de "default implementation"
3.3.2) External policy engines (e.g., seedwing, OPA) can be called with this mechanism
3.3.3) Important debugging feature: a keylime_boot_policy_run
that takes a policy plugin and a MB log as arguments and returns an error.
3.4) The IMA policy engine also becomes a plugin (same arch of other plugins).
3.3.1) Apply policy to the (already validated) IMA log
3.3.2) The currently existing policy engine mechanism becomes de "default implementation"
3.3.2) External policy engines (e.g., seedwing, OPA) can be called with this mechanism
3.3.3) Important debugging feature: a keylime_runtime_policy_run
that takes a policy plugin and an IMA log as arguments and returns an error.
4) Add also a plugin architecture to deal withe "Confidential Computing" attestation reports. There are some work done by other teams in "Attestation as a Service" to "confidential containers" and we (@maugustosilva @galmasi) have very good working relationship with members of the team developing this. Their "Key Broker Service" has a well-defined API that we could be considering for use (https://github.com/confidential-containers/kbs/blob/main/docs/kbs_attestation_protocol.md)
5) Add also a plugin architecture to deal with attestation failures: the verifier
can invoke these plugins for specific agents in case of failures.
6) "Push model": an agent
talks to a registrar
to get its EK/AK on a database, it is automatically added to a verifier
(see item 1) and receives the contact information from such verifier
. From this point on, it can periodically send attestation artifacts to be verified.
6.1.) The contact information can be pulled by the agent
after successful adding to a verifier
, or could be sent by either verifier
or registrar
6.2.) Please note that there has not been much work on previous proposal https://github.com/keylime/enhancements/issues/60 and https://github.com/keylime/enhancements/issues/57.
7) Support for different kinds of signing and encryption keys (in addition to EK/AK), such as IDevID, IAK and SRK. 7.1.) HPE plans to open a PR for IDevID, IAK
9) Remove the use of all tpm2-tools
command line invocations from the registar
and verifier
code.
9.1.) We need to replace the use of tpm2_makecredential
on the registrar
9.2.) We need to replace the use of tpm2_eventlog
on the verifier
. There is undergoing effort to do so at https://github.com/keylime/python3-uefi-eventlog
9.3.) We need to replace the use of tpm2_checkquote
on both the verifier
and the tenant
9.4.) We need to replace the use of tpm2_print
on the verifier
10) Redesign of main event loop on verifier
: goal here is to make it more scalable.
11) Currently, registrar
does not use the tornado framework. Consider using it.
- A
registrar
can automatically add nodes to averifier
1.1.) A "quick and dirty" (yet useful) way of implementing it would to have a parameter that allows theregistrar
to call thetenant
code at the end of registration 1.2.) A more permanent possibility would be to just rework theregistrar
to perform most of thetenant
functions directly. Maybe eliminating the concept of thetenant
altogether (although I am not entirely sure of it). Please consider possible interplay with item 11
What is tenant
here? Our keylime_tenant
tool that has all these commands and many options with IP addresses, runtime policy etc.?
- Add the ability to "attest only a certain number of times and then stop". After the limit of attestation is reached, the
agent
is put on TERMINATED state, instead of FAILED 2.1.) We have an interesting use case where a customer wants to boot with theagent
running on a RAMDISK, and then, during the initial adding to averifier
, atenant
sends a payload containing a key to decrypt the root filesystem on disk. On this scenario, the attestation can stop after this first delivery is done.
What is 'a tenant
' in this context?
- 1.2.) A more permanent possibility would be to just rework the
registrar
to perform most of thetenant
functions directly. Maybe eliminating the concept of thetenant
altogether (although I am not entirely sure of it). Please consider possible interplay with item 11
@maugustosilva An interesting suggestion. Simplification is good if it doesn't break important use cases. Did you intend to include an item 11? I don't see it.
- A
registrar
can automatically add nodes to averifier
1.1.) A "quick and dirty" (yet useful) way of implementing it would to have a parameter that allows theregistrar
to call thetenant
code at the end of registration 1.2.) A more permanent possibility would be to just rework theregistrar
to perform most of thetenant
functions directly. Maybe eliminating the concept of thetenant
altogether (although I am not entirely sure of it). Please consider possible interplay with item 11What is
tenant
here? Ourkeylime_tenant
tool that has all these commands and many options with IP addresses, runtime policy etc.?
- Add the ability to "attest only a certain number of times and then stop". After the limit of attestation is reached, the
agent
is put on TERMINATED state, instead of FAILED 2.1.) We have an interesting use case where a customer wants to boot with theagent
running on a RAMDISK, and then, during the initial adding to averifier
, atenant
sends a payload containing a key to decrypt the root filesystem on disk. On this scenario, the attestation can stop after this first delivery is done.What is 'a
tenant
' in this context?
Right, in the original Keylime's architecture, tenant
was a fourth component (in addition to registrar
, verifier
and agent
) which would be in charge of "closing the loop" in terms of initiating an attestation by requiring the explicit pairing of an agent
to a verifier
. It was done so due to the fact that Keylime was envisioned to be used in a "shared attestation infrastructure", where multiple "tenants", without any mutual trust had to co-exist on the same registrar
and tenant
.
So, yes, this "component" is the one which executes keylime_tenant
commands.
- 1.2.) A more permanent possibility would be to just rework the
registrar
to perform most of thetenant
functions directly. Maybe eliminating the concept of thetenant
altogether (although I am not entirely sure of it). Please consider possible interplay with item 11@maugustosilva An interesting suggestion. Simplification is good if it doesn't break important use cases. Did you intend to include an item 11? I don't see it.
Very good point! It is precisely why I was considering the (IMO) simple case where we allow the registrar
to simply behave as a "in loco tenant
" and simply call the same code path. What I am not proposing here is that we formally remove the tenant
from the architecture, since there are important use cases where this will be needed.
The "in loco tenant" model is better for our use cases. If I understand it correctly we wouldn't have to run the tenant.
The "in loco tenant" model is better for our use cases. If I understand it correctly we wouldn't have to run the tenant.
You did understand correctly. The registrar
would automatically add a new agent
to a verifier
and an admin wouldn't have to even know about tenant
anymore.
The "in loco tenant" model is better for our use cases. If I understand it correctly we wouldn't have to run the tenant.
You did understand correctly. The
registrar
would automatically add a newagent
to averifier
and an admin wouldn't have to even know abouttenant
anymore.
... choosing which runtime policy for the monitored machine?
Since we are discussing the architecture redesign, I would like to propose some ideas for consideration:
1 - To support multiple attestation mechanisms/roots of trust other than only the TPM, we could change the registration step for the agent to inform the list of types of quotes it can generate, and also the supported mode of operation (push or pull). If the list is not provided, it is assumed the agent can only generate the current format of quote. This will allow backwards compatibility, and also heterogeneous agents running.
2 - Based on the capabilities declared by the agent, the verifier could select which types of quote it wants from the agent by setting new parameters in the GET request. If nothing is set, then it is assumed the current format of quote is requested, again keeping backwards compatibility.
3 - The message format for the quote should be attestation mechanism agnostic, able to contain any kind of data. The current quote message format could be one of the supported types, something like:
{
"type": "TPM",
"quote": <base64 encoded JSON string containing the current quote message format>
}
The agent response could contain a list of requested quotes. If not, then it is assumed the current quote format is expected, keeping backwards compatibility.
4 - We should consider redesigning the keylime protocol to be post-quantum crypto ready and FIPS compliant. Of course PQ ready TPM hardware will take years (decades?) to be available, but I'm suggesting we make keylime flexible where we can (e.g. keys used for TLS).
5 - The exchange of U
and V
keys encrypted by an RSA key is mandatory. When the mTLS is enabled, the data is transmitted through a trusted encrypted channel, with mutual authentication. It shouldn't be necessary to encrypt U
and V
in this case.
6 - mTLS is difficult to be configured due to the certificates distribution complexity. We should consider supporting other authentication mechanisms that scale better and simplify the deployment.
The "in loco tenant" model is better for our use cases. If I understand it correctly we wouldn't have to run the tenant.
You did understand correctly. The
registrar
would automatically add a newagent
to averifier
and an admin wouldn't have to even know abouttenant
anymore.... choosing which runtime policy for the monitored machine?
Good question there are options. Having a default in the verifier, allowing the agent specify as part of registration and the tenant overriding the default.
Good question there are options. Having a default in the verifier, allowing the agent specify as part of registration and the tenant overriding the default.
How is the initial registration, which is typically done via keylime_tenant -c add
, to be kicked off in this scenario?
How is the initial registration, which is typically done via
keylime_tenant -c add
, to be kicked off in this scenario?
I thought the idea was that the registrar would do it automatically.
How is the initial registration, which is typically done via
keylime_tenant -c add
, to be kicked off in this scenario?I thought the idea was that the registrar would do it automatically.
The registrar would have to know the IP address of host that's going to be monitored. How does it get that?
Regarding a default policy: If all the systems were the same there could be a default policy for monitoring the TPM and IMA logs but I think most environments won't have a collection of homogenous systems but they will be different and then each system will need its own policy. I don't think a default policy of monitoring 'nothing', which would apply across non-homegenous systems, would help much.
These are good questions @stefanberger so, allow me to offer my experiences deploying Keylime in prod (the mileage will most certainly vary, but it is at least one data point from a Cloud production environment).
1) Even in our environment with highly heterogeneous nodes, we end up with a single measured boot
(MB) and runtime
(IMA) policies
2) The registrar
component has access to /etc/keylime/tenant.conf
, and on this file, three important pieces of information are present
a) Verifier
IP/Port
b) "default" MB policy
c) "default" IMA policy
Finally, as you said yourself, even if we resort to an "auto-add is just for "node identity", no MB, no IMA" it would be already useful (it is at the very least a "pipe cleaner" to detect verifier
<-> agent
communication problems).
An additional point: I currently don't see a lot of heterogeneity on TPM policy (after all, in "full attestation", PCRs 0-9 and 11-14 are already "taken") but not saying that this will never be a problem.
I guess it depends on how strict/custom tailored your policies are for those highly heterogeneous nodes that a common denominator policy may or may not exist.
Finally, as you said yourself, even if we resort to an "auto-add is just for "node identity", no MB, no IMA" it would be already useful (it is at the very least a "pipe cleaner" to detect
verifier
<->agent
communication problems).
Where do errors then go to? To log files?
Closing because the new push model proposal is now being implemented.
Attendees
Topic
This meeting is intended to discuss on how we implement the architecture changes in Keylime