spiffe / spire

The SPIFFE Runtime Environment
https://spiffe.io
Apache License 2.0
1.77k stars 469 forks source link

[RFC] Certificate Transparency support #1858

Closed Ruide closed 11 months ago

Ruide commented 3 years ago

Co-authored by @elinesterov.

Background

Certificate Transparency (CT) is first proposed to mitigate structural defects of internet PKI infrastructure 1. CT allows the detection of misissued certificates from a Certificate Authority (CA). CT provides an open framework to support the monitoring and auditing of certificates issued by a CA. In this way, all issued certificates become openly auditable. Domain owner or CA can determine whether certificates have been mistakenly or maliciously issued. And this protects domain users (as much as possible) from being duped by certificates that were mistakenly or maliciously issued. It is widely adopted by Web PKI and modern browsers.

Currently, the SPIRE server is responsible for the issuance of all SVIDs. SVIDs are conceptually the certificates which CT protects in the SPIRE scenario and the SPIRE server could be conceptually mapped to CA. By incorporating CT to SPIRE, all issued SVIDs become openly auditable to any interested party (internal or external auditor and monitor). Thus, an interested party could determine whether SVIDs have been mistakenly or maliciously issued. And this mitigates the threat of a compromised SPIRE server or stolen SPIRE server identity.

Proposal

To integrate CT with SPIRE, we detail four design variants in 2. We choose to implement the x509v3 extension design variant in this proposal. x509v3 extension design variant requires the SPIRE server to send a poisoned preCertificate to the CT server and the CT server returns a signed certificate timestamp (SCT) to the SPIRE server. SPIRE server then embeds fetched SCT to the SVID SCT field. Any issued SVIDs without valid SCT field are untrusted.

Sample implementation

The following is a description of a sample implementation of the proposed design variant, including the changes needed in the SPIRE server to fetch SCT and issue SVIDs with the SCT field, and the CT components required.

SPIRE

Add a new plugin type to perform the fetching SCT action in the SPIRE server. Have a new plugin for each CT implementation. For example, we provide an internal plugin to support Google CT implementation. Other implementation details can be found in 2.

CT

The SPIRE server needs to call the API exposed by the CT server to log SVIDs and fetch SCT to mince SVID with SCT field. This proposal uses an open-source implementation from Google 3. The CT server exposes an HTTP port to SPIRE Server. The demo set up of a CT server can be found here 4. Note that, in the demo set up, the CT server only accepts signed certificates by a fake CA. For the SPIRE use case, an interested user needs to set a path to his/her CA public key in the roots_pem_file field of demo-script.cfg file.

Request for Comments

This proposal tries to layout changes needed in SPIRE server and possible implementation to integrate Certificate Transparency. Any feedback on the general direction of this proposal, any missing points, suggestions or thoughts in general is greatly appreciated.

evan2645 commented 3 years ago

Thank you very much for opening this @Ruide and @elinesterov! This feature has been requested in the past, and it's great to see a potential contribution of this nature. I'm excited to see SPIRE grow support for this :)

Here is a direct link to the referenced doc in case others missed the hyperlink: https://docs.google.com/document/d/e/2PACX-1vTbhJvLyCzjZhCbYRZqVa_OiTWr7XpwPZSv71hg66rLoLzg2q_rln2fdV698vmlDvMEkmAI0iA5fpdN/pub

SVIDs are conceptually the certificates which CT protects in the SPIRE scenario and the SPIRE server could be conceptually mapped to CA. By incorporating CT to SPIRE, all issued SVIDs become openly auditable to any interested party (internal or external auditor and monitor). Thus, an interested party could determine whether SVIDs have been mistakenly or maliciously issued. And this mitigates the threat of a compromised SPIRE server or stolen SPIRE server identity.

It's important to note that X.509 is only one kind of SVID - there are also JWT-based SVIDs that would not benefit from the implementation of CT. Is that a concern? Seems like it would be... is there any analog available for JWT? Alternatively, should we consider a SPIRE feature that can disable the use/issuance of JWT-SVIDs?

We choose to implement the x509v3 extension design variant in this proposal. x509v3 extension design variant requires the SPIRE server to send a poisoned preCertificate to the CT server and the CT server returns a signed certificate timestamp (SCT) to the SPIRE server. SPIRE server then embeds fetched SCT to the SVID SCT field.

This does seem to be the path of least friction, and fits well onto existing SPIFFE/SPIRE architecture.

Any issued SVIDs without valid SCT field are untrusted.

Untrusted by... SPIFFE-participating workloads? To be valuable, workloads must actually validate this field - are they guaranteed to? Is this X.509 extension intended to be marked as critical?

Validators will also need to somehow obtain the public key that can be used to verify the SCT. How does this work in practice? I do not see a reasonable way for SPIRE to deliver this information... further, even if we could, I think it would largely violate the threat model as a compromised SPIRE Server could simply inject an attacker-controlled key here. In implementing this feature, I think we need to document and/or demonstrate how users are expected to manage and distribute this key.

I also wonder how this is supposed to work with federation, especially if the extension is marked critical. How should workloads in foreign trust domains obtain the SCT validation key? This challenge is similar to the above observation as most frequently, it is SPIRE Server that is serving the federation api and publicizing its signing keys.

Add a new plugin type to perform the fetching SCT action in the SPIRE server. Have a new plugin for each CT implementation.

The API to be used by the CT Frontend and its consumers appears to be standardized by RFC 6962. What kind of implementation-specific differences do you envision that would warrant a plugin-based approach? IMO, the ideal implementation would build this in as a core feature that works with any supplied log server URL.

Ruide commented 3 years ago

Thank you very much for the great comments @evan2645 ! We put our thoughts in the following. Please let us know what you think!

It's important to note that X.509 is only one kind of SVID - there are also JWT-based SVIDs that would not benefit from the implementation of CT. Is that a concern? Seems like it would be... is there any analog available for JWT? Alternatively, should we consider a SPIRE feature that can disable the use/issuance of JWT-SVIDs?

CT specification doesn't support JWT-SVID. So in the current implementation, we do not consider JWT-SVIDs. We would need to dive deep into JWT data structure and figure out how we could integrate it. In theory, we think CT for JWT-SVIDs is doable. However, it may require significant change on the generation of JWT. Yes, including a switch to turn on/off JWT-SVIDs would be helpful.

Any issued SVIDs without valid SCT field are untrusted.

Untrusted by... SPIFFE-participating workloads? To be valuable, workloads must actually validate this field - are they guaranteed to? Is this X.509 extension intended to be marked as critical?

Yes, untrusted by any entity who depend their trust on SVIDs including SPIFFE workloads. So for example. When two workloads want to estabIish mTLS, they only trust SVIDs with valid SCT embedded. Yes, ultimately, workloads must actually validate this field. In the long run, this field is intended to be marked as critical. But for the current implementation, it is marked as non-critical so that we can have an easy rollback for error handling. We would like to wait until the CT infrastructure and SPIRE plugin become stabilized before we mark the SCT field critical.

Validators will also need to somehow obtain the public key that can be used to verify the SCT. How does this work in practice? I do not see a reasonable way for SPIRE to deliver this information... further, even if we could, I think it would largely violate the threat model as a compromised SPIRE Server could simply inject an attacker-controlled key here. In implementing this feature, I think we need to document and/or demonstrate how users are expected to manage and distribute this key.

In the four design variants, we have enforcement mode (x509v3 extension, TLS extension, OCSP stampling) and logging mode. For enforcement mode, you are absolutely right, the public key of CT should not be distributed by SPIRE. For logging mode, SPIRE server is trusted, so in this mode, the trust bundle can be distributed by SPIRE as usual. And for enforcement mode, instead of depending on SPIRE to distribute trust bundle, we recommend the users let agent or validator use a 3rd party channel to get public key/keys for validation. That could be done in the similar way as distributing initial trust bundle for spire agent - via URL or CM/deployment process.

The users may also choose to use one root certificate (upstream authority) to sign both the certificate of SPIRE server and the certificate of CT. We assume root certificate can be distributed with a 3rd party channel. So SPIRE agent can fetch the trust bundle and Workload can validate SVIDs with SCT using certificate of CT (which roots to upstream authority). For users who want to set up different roots for SPIRE and CT. They only need to add both root certificates to the trust bundle distributed by aforementioned methods. And yes, we agree that we may want to provide an example set up document for guiding potential users.

I also wonder how this is supposed to work with federation, especially if the extension is marked critical. How should workloads in foreign trust domains obtain the SCT validation key? This challenge is similar to the above observation as most frequently, it is SPIRE Server that is serving the federation api and publicizing its signing keys.

Yep. This is up to validator to obtain the public key for SCT verification. We definetelly can extand SPIRE agent to be bale to point out to something like ct_trust_bundle that will contain all public keys. It could be on a file system or URL. Also Workload API could be extended to provide API for SCT validity check though we feel it should be implemented by the Workload to avoid additional latency.

Add a new plugin type to perform the fetching SCT action in the SPIRE server. Have a new plugin for each CT implementation.

The API to be used by the CT Frontend and its consumers appears to be standardized by RFC 6962. What kind of implementation-specific differences do you envision that would warrant a plugin-based approach? IMO, the ideal implementation would build this in as a core feature that works with any supplied log server URL.

We have 3 reasons to create a new plugin type:

evan2645 commented 3 years ago

TL;DR: There are still some things to think about here. If I had to guess, the fastest path to upstream feature is a core implementation of X.509v3 extension with no ability to mark the extension as critical. The downside there is that it's a little less useful. The ideal solution would solve for both JWT-SVID and for automatic configuration in the federation use case, however that would need to spend time with SIG-Spec... probably a lot of time.

More detailed responses below

In theory, we think CT for JWT-SVIDs is doable. However, it may require significant change on the generation of JWT. Yes, including a switch to turn on/off JWT-SVIDs would be helpful.

We should carefully consider if a feature to disable JWT-SVIDs should be required as part of this work. I think the assumption here is that, if someone is using CT, then they really do not want to be using SVIDs that do not have a cryptographically verifiable proof of audit... Leaving an SVID type readily available for use that cannot meet that requirement feels pretty dangerous.

I think the ideal situation is that we can solve CT for JWT-SVIDs... though I also realize that it is probably (very) over-ambitious :). If you see a low overhead path for this, I'd love to hear more.

Yes, ultimately, workloads must actually validate this field. In the long run, this field is intended to be marked as critical. But for the current implementation, it is marked as non-critical so that we can have an easy rollback for error handling. We would like to wait until the CT infrastructure and SPIRE plugin become stabilized before we mark the SCT field critical.

The value of this feature really depends on validators enforcing it. In practice, I've found this to be a pretty weak guarantee - JWT aud and associated vulnerabilities is a great example.

Marking the field as critical solves this problem, but also introduces others. Federation in this context is the largest challenge I think (more on that below).

What is the behavior of the most popular X.509/TLS libraries when they encounter a critical SCT extension? Does validation fail if validation of the SCT signature fails? How do those libraries typically obtain the SCT validation public key?

For logging mode, SPIRE server is trusted

From reading the Google Doc, it seemed like an explicit goal of this work is to mitigate SPIRE Server compromise - is that accurate? If so, I think that logging mode is a non-starter?

And for enforcement mode, instead of depending on SPIRE to distribute trust bundle, we recommend the users let agent or validator use a 3rd party channel to get public key/keys for validation. That could be done in the similar way as distributing initial trust bundle for spire agent - via URL or CM/deployment process.

Do popular libraries allow you to configure them with a URL for fetching this key? That would be easy for us to reason about documentation-wise.

I also wonder how this is supposed to work with federation, especially if the extension is marked critical. How should workloads in foreign trust domains obtain the SCT validation key? This challenge is similar to the above observation as most frequently, it is SPIRE Server that is serving the federation api and publicizing its signing keys.

Yep. This is up to validator to obtain the public key for SCT verification. We definetelly can extand SPIRE agent to be bale to point out to something like ct_trust_bundle that will contain all public keys. It could be on a file system or URL. Also Workload API could be extended to provide API for SCT validity check though we feel it should be implemented by the Workload to avoid additional latency.

I worry about extending either the bundle or the workload API as both of those things are part of the SPIFFE spec.

Thinking about this further, it feels like marking the SCT extension as critical will severely devalue SPIFFE Federation. A very small percentage of TLS-enabled workloads are explicitly configured for handling SCT. Or perhaps I should say, I have personally seen zero (maybe some sample bias there). While a given company or organization may be able to align on SCT validation, what happens when you need to communicate outside of those boundaries? Normally, this can be seamlessly enabled by simply establishing a federation relationship at the control plane level. If the SCT extension were to be marked as critical, it would require the remote workload's TLS configuration to be modified for federation to work - a problem that SPIFFE is explicitly trying to solve.

Is it common for SCT keys to be served by an endpoint that can be validated using Web PKI?

We have 3 reasons to create a new plugin type: ... We have 4 design variants for integrating CT. It would be easy for users to implement their chosen design variant with the new plugin type. Our current implementation of a new internal plugin provides an example of x509v3 extension. However, the interfaces for implementing other design variants are also included in the current new plugin type.

It seems to me that only the X.509v3 and logging extension options are fully implementable as a server plugin. The other options all require agent modification?

In the future, we may want CT to log extra Metadata. By exposing a plugin type, it would be easier to add interface or change existing interface to log generalized Metadata. So the core can be decoupled. And this design would be easier to extend.

Code changes in core are required to extend the interface to pass additional metadata to the plugin. Since core changes are required anyway, why not just implement those reporting features in core?

In the current implementation, we support multiple CTs. The SCTs of all CTs would be embedded in SVIDs. By using the new plugin type, it is easy to integrate new CT instance. We can simply add the CT information to server.conf file.

This is an interesting point that I did not consider, though I still feel that we could implement in core the ability to configure more than one CT endpoint. You'd retain the ability to add CT information by editing server.conf file.

Ruide commented 3 years ago

Hi @evan2645 , sorry for the delay, and thanks for the great feedback as usual!

TL;DR: There are still some things to think about here. If I had to guess, the fastest path to upstream feature is a core implementation of X.509v3 extension with no ability to mark the extension as critical. The downside there is that it's a little less useful. The ideal solution would solve for both JWT-SVID and for automatic configuration in the federation use case, however that would need to spend time with SIG-Spec... probably a lot of time.

More detailed responses below

In theory, we think CT for JWT-SVIDs is doable. However, it may require significant change on the generation of JWT. Yes, including a switch to turn on/off JWT-SVIDs would be helpful.

We should carefully consider if a feature to disable JWT-SVIDs should be required as part of this work. I think the assumption here is that, if someone is using CT, then they really do not want to be using SVIDs that do not have a cryptographically verifiable proof of audit... Leaving an SVID type readily available for use that cannot meet that requirement feels pretty dangerous.

I think the ideal situation is that we can solve CT for JWT-SVIDs... though I also realize that it is probably (very) over-ambitious :). If you see a low overhead path for this, I'd love to hear more.

In our current usage, we do not take advantage of JWT-SVIDs. So our first priority is to integrate with X509-SVIDs. That said, I think if we want to provide Transparency for JWT-SVID as well, the following would be a tentative route.

1 - We require workload's validation of JWT to not only verify the signature over JWT, but also Signed JWT Timestamp (SJT). 2 - We require an implementation of JWT frontend based on Trillian database. See https://github.com/google/trillian-examples. 3 - We need to extend the current SPIFFE-JWT with Signed JWT Timestamp (SJT) to SPIFFE-JWT-CT. The overall layout are pictured as following:

         +------------------------------+     +---------------------------+
         |SignedJWTTimestamp            |     |SPIFFE-JWT-CT              |
         |------------------------------|     |---------------------------|
         |SJTVersion:(StringOrURI)      |     +------------------------+  |
         |LogID:(StringOrURI)           |     | Header                 |  |
         |timestamp:(StringOrURI)       |     |------------------------|  |
         |reservedExtension             |     | alg                    |  |
         |signature(JWS)                |     +------------------------+  |
         +------------------------------+     |                           |
              |digitally-signed               +------------------------+  |
    CT privkey|over   +-----------------+     |Body                    |  |
              +------>|SPIFFE-JWT       |     |------------------------|  |
                      |-----------------|     |sub                     |  |
                      |---------------+ |     |aud                     |  |
                      |Header         | |     |exp                     |  |
                      |---------------| |     +---------------------+  |  |
       JWS            |alg            | |     |SignedJWTTimestamp   |  |  |
     PAYLOAD +------> |---------------+ |     +---------------------+--+  |
        +             |---------------+ |     |                           |
        |             |Body           | |     |                           |
        |             |---------------| |     +------------------------+  |
        v             |sub            | |     |Signature               |  |
    logged in         |aud            | |     |------------------------|  |
Transparency log      |exp            | |     |over(header+"."+payload)|  |
                      |---------------+ |     +------------------------+  |
                      +-----------------+     +---------------------------+

Yes, ultimately, workloads must actually validate this field. In the long run, this field is intended to be marked as critical. But for the current implementation, it is marked as non-critical so that we can have an easy rollback for error handling. We would like to wait until the CT infrastructure and SPIRE plugin become stabilized before we mark the SCT field critical.

The value of this feature really depends on validators enforcing it. In practice, I've found this to be a pretty weak guarantee - JWT aud and associated vulnerabilities is a great example.

That is true. What I can think of is to put it in the internal RPC SDK or Istio Envoy mTLS. So it would be security-by-default.

Marking the field as critical solves this problem, but also introduces others. Federation in this context is the largest challenge I think (more on that below).

What is the behavior of the most popular X.509/TLS libraries when they encounter a critical SCT extension? Does validation fail if validation of the SCT signature fails? How do those libraries typically obtain the SCT validation public key?

For the web PKI case, from my discussion with Al from Google TrustFabric, the browser baked the trusted CT certificates into its image (see the trusted ct log list here https://www.gstatic.com/ct/log_list/v2/log_list.json). Libraries typically require a file path to Trusted CT log Operator. For current web PKI, SCT List field is set to Non-critical (you may check it from github.com certificates details, and here is the code to mint sctlist extention field https://github.com/google/certificate-transparency-go/blob/7710282e49162cbd95c500777522f436fd5fc279/x509/x509.go#L2582 you can see the critical tag field is not set). If SCT List field is set to critical, I assume the validation would fail if the library cannot parse this extension field since it's the expected semantics of the critical field. But the validation entirely relies on application code (or envoy mtls code).

For logging mode, SPIRE server is trusted

From reading the Google Doc, it seemed like an explicit goal of this work is to mitigate SPIRE Server compromise - is that accurate? If so, I think that logging mode is a non-starter?

The intention for having a logging mode is for testing the availability of CT infrastructure and for try the Integration out. But you are right, probably try it with SCT list field set to non-critical could serve the same purpose.

And for enforcement mode, instead of depending on SPIRE to distribute trust bundle, we recommend the users let agent or validator use a 3rd party channel to get public key/keys for validation. That could be done in the similar way as distributing initial trust bundle for spire agent - via URL or CM/deployment process.

Do popular libraries allow you to configure them with a URL for fetching this key? That would be easy for us to reason about documentation-wise.

I do not see libraries configure them with a URL for fetching this key. I see them mostly rely on a local file for trusted CT Log list.

I also wonder how this is supposed to work with federation, especially if the extension is marked critical. How should workloads in foreign trust domains obtain the SCT validation key? This challenge is similar to the above observation as most frequently, it is SPIRE Server that is serving the federation api and publicizing its signing keys.

Yep. This is up to validator to obtain the public key for SCT verification. We definetelly can extand SPIRE agent to be bale to point out to something like ct_trust_bundle that will contain all public keys. It could be on a file system or URL. Also Workload API could be extended to provide API for SCT validity check though we feel it should be implemented by the Workload to avoid additional latency.

I worry about extending either the bundle or the workload API as both of those things are part of the SPIFFE spec.

I think we may want to follow the design of web PKI, where we do not rely on the distribution of CTLog identity on SPIRE. But to distribute it in os image. E.g. baked in /etc/ssl/certs. While as you mentioned before, the rotation and revocation may require careful treatment.

Thinking about this further, it feels like marking the SCT extension as critical will severely devalue SPIFFE Federation. A very small percentage of TLS-enabled workloads are explicitly configured for handling SCT. Or perhaps I should say, I have personally seen zero (maybe some sample bias there). While a given company or organization may be able to align on SCT validation, what happens when you need to communicate outside of those boundaries? Normally, this can be seamlessly enabled by simply establishing a federation relationship at the control plane level. If the SCT extension were to be marked as critical, it would require the remote workload's TLS configuration to be modified for federation to work - a problem that SPIFFE is explicitly trying to solve.

I think the best we can do is to mark it as non-critical and rely on RPC sdk or Envoy proxy to enforce the SCT validations. In order to solve the boundary issue you mention here, especially for external organization case.

Is it common for SCT keys to be served by an endpoint that can be validated using Web PKI?

It is common for CT log certificates to be baked in browser binary.

We have 3 reasons to create a new plugin type: ... We have 4 design variants for integrating CT. It would be easy for users to implement their chosen design variant with the new plugin type. Our current implementation of a new internal plugin provides an example of x509v3 extension. However, the interfaces for implementing other design variants are also included in the current new plugin type.

It seems to me that only the X.509v3 and logging extension options are fully implementable as a server plugin. The other options all require agent modification?

You are definitely correct. X.509v3 extension would be the recommended mode. While we put other modes here for completeness. (just in case of people wanting to know why we choose this mode)

In the future, we may want CT to log extra Metadata. By exposing a plugin type, it would be easier to add interface or change existing interface to log generalized Metadata. So the core can be decoupled. And this design would be easier to extend.

Code changes in core are required to extend the interface to pass additional metadata to the plugin. Since core changes are required anyway, why not just implement those reporting features in core?

I think for the initial development, by separating it to plugin framework, we can mitigate the code we need to add to core. So we do not need to worry about it intruding too much and breaking core. Probably in the future, we might want it in core. But I am probably not the best person to make a suggestion. Since I did not read through all the code in SPIRE :)

In the current implementation, we support multiple CTs. The SCTs of all CTs would be embedded in SVIDs. By using the new plugin type, it is easy to integrate new CT instance. We can simply add the CT information to server.conf file.

This is an interesting point that I did not consider, though I still feel that we could implement in core the ability to configure more than one CT endpoint. You'd retain the ability to add CT information by editing server.conf file.

Yea, I think in terms of whether to put it in core or plugin framework, you must have a better understanding than I do. In theory, both should work. My concern was mainly on if CT should be a default feature of SPIRE, or should it be an optional feature. In my opinion, we might want it to be optional. So to put the whole codebase of CT in the plugin framework, it feels more like an optional plug-and-play. And that is also how it is currently implemented.

Again, thank you for your comments!