spiffe / spire

The SPIFFE Runtime Environment
https://spiffe.io
Apache License 2.0
1.73k stars 463 forks source link

DNS/HTTP Node Attestor #4788

Open kfox1111 opened 6 months ago

kfox1111 commented 6 months ago

On bare metal nodes without TPM's, it would be very nice if using HTTP/DNS like ACME does for initial attestation could be used for bootstrapping rather then needing to ssh in (and accept an untrusted key) and using a join token. It wouldn't need to be ACME itself, but something that functions similarly.

kfox1111 commented 5 months ago

I'm currently thinking, I start with the x509pop plugin, copy it to a plugin named 'http', then modify it as follows:

For the server plugin, change its Attest function, removing the x509 cert validation bits. Then change the challenge to generate a 'token' as per https://datatracker.ietf.org/doc/html/rfc8555#section-8.3. It is returned to the agent.

The agent would then start a webserver on port 80 (default) or any configured port. (If port is != 80, something else needs to proxy on the host from 80->the chosen port). The agent would share out just "/.well-known/acme-challenge/$token" as per the acme rfc. The content would be the token

Once the webserver is started, the agent would respond to the server that its ready, along with its proposed dns name. The Server plugin would first validate the dns name against a regex list its configured with of valid dns names it is willing to test. If it matches, it fetch the document from the agents webserver and validates the token matches. If it all passes, it generates a node identity with a selector matching the dns name attested.

aaomidi commented 5 months ago

So the concern I have with this flow is:

Some assumptions:

The ideal scenario is:

Now imagine this scenario:

At this point the bad actor has successfully hijacked the issued identity.

Note: I may be making some wrong assumptions here, that may make this not really a possible attack.

Let me think a bit more about how this would work securely.

aaomidi commented 5 months ago

So, I think the way I can see this being made a bit safer without making a ton of changes to this flow is to use a self-signed mTLS identity for the Client side.

The server would need to be configured to trust all and any client certificate for establishing an mTLS session for the attestation endpoint.

Once we have that assumption, this DNS/HTTP auth plugin can be designed that for the entire lifetime of the challenge, the challenge is scoped to that specific client certificate. E.g. if the certificate is changed, it can not interject itself into another challenge-response flow.

kfox1111 commented 5 months ago

Ah. I see... I'll think some more on this too. Thanks. :)

kfox1111 commented 5 months ago

Looking at the plugin code, the plugin will only respond to the request from the same client tcp stream (?) so not sure a bad actor can man in the middle that process.

If they can, it looks like there is a piece in the acme protocol meant to handle that: https://datatracker.ietf.org/doc/html/rfc8555#section-8.3

The initial client request is done with a jwk pair, and the client is expected to put its public fingerprint at the http token url as well. If we did the same, it would also close the loop I think?

aaomidi commented 5 months ago

The initial client request is done with a jwk pair, and the client is expected to put its public fingerprint at the http token url as well. If we did the same, it would also close the loop I think?

Yes ACME gets around this by using the ACME account. I didn't know if you wanted to build an ACME account model here.

Looking at the plugin code, the plugin will only respond to the request from the same client tcp stream (?) so not sure a bad actor can man in the middle that process.

I think as long as the bad actor isn't a layer 7 proxy that you connected to to talk to the server it should be fine.

Note that the layer 7 proxy would have to be the one terminating TLS too, so it'd come down to do you trust the TLS certificate of the server at that point.

kfox1111 commented 5 months ago

Ah, gotcha.

Not sure we need to adopt all of ACME, but there may be some advantage to reusing the bits of their protocol that work, to solve all the same problems? I could go either way though.

aaomidi commented 5 months ago

Honestly, I think if this is scoped to a single TCP connection, and new TCP connections would have to full restart the flow, you'd solve the majority of my concerns with this.

The only other stipulation being that the server MUST be protected by TLS for this to work properly.

kfox1111 commented 5 months ago

Honestly, I think if this is scoped to a single TCP connection, and new TCP connections would have to full restart the flow, you'd solve the majority of my concerns with this.

I think that is currently true with spire's currnent plugin model? Anyone we can have double check that assumption?

The only other stipulation being that the server MUST be protected by TLS for this to work properly.

Just to double check, your referring here to the server plugin hosted out of spire, which is TLS protected?

The temp webserver for the handshaking can be http only?

aaomidi commented 5 months ago

I think that is currently true with spire's currnent plugin model? Anyone we can have double check that assumption?

If so then I think the initial proposal wouldn't create any concerns.

Just to double check, your referring here to the server plugin hosted out of spire, which is TLS protected? The temp webserver for the handshaking can be http only?

Yes & Yes

amartinezfayo commented 5 months ago

Thank you @kfox1111 for bringing this up and thank you @aaomidi for your feedback. We have discussed this in the last maintainer's call and we think that the absence of an attestor for bare metal nodes without TPM's is a real problem that we want to address in the project. The solution for this problem will always include trusting a third-party, in the proposed solution it would be DNS. We haven't explored if there are better options, so we are open for other solutions as well.

@kfox1111 If you think that a DNS/HTTP node attestor is the best option, and in the absence of other proposals, it would be great to make progress on scoping the work that needs to be done for the proposed solution, including some more details about the implementation, configuration and the mechanics of the attestation. Some of the important aspects that we need to figure out in order to have a clear scope are:

I'm sure there are other things to figure out also, but finding answers to those items will help a lot to have this scoped.

Thanks again @kfox1111 for bringing this up to our attention!

evan2645 commented 5 months ago

I think that is currently true with spire's currnent plugin model? Anyone we can have double check that assumption?

Yes it is correct. SPIRE server/agent node attestation is a bi-directional gRPC stream. It remains open until node attestation is complete. In the case we're discussing, the server will initiate the challenge check all while the agent is blocked on it, and the server will unblock after success. So the whole process is covered by a single stream lifetime.

Thanks @amartinezfayo for the guidance, I agree answers to those points will help to move the issue out of unscoped. Considering my above comment, as far as the flows go, a starting point can be: Agent -> configured DNS name -> Server Agent <- nonce <- Server (agent binds random port and serves nonce) Agent -> port number -> Server (server checks the nonce) Agent <- success/SVID <- Server

I'm sure it will change as we get answers to e.g. multiple hosts, configuration (dns server config?) etc.

One nice thing about this attestation type is it's repeatable.

kfox1111 commented 5 months ago

Will work on these things. but initial thoughts inline:

Configuration of the plugin in the server and in the agent.

server:
dns_patterns:   # Optional list of regexes dns hostname need to match. If empty, all dns entries are alllowed. If none match, the request is rejected.
- <regex>
- <regex>
agent:
  hostname: # Optional. If unset, use the hostname as detected on the node. If running in a container, this may need to be set explicitly.
  port: 80 # Optional port to listen on. Default is 80, and if not 80, some other webserver on the host needs to port forward to whatever port is chosen here.

Challenge/Response flow diagram.

Will work on this. Some potential details discussed above.

I'm thinking of sticking to port 80 from server -> agent for the reasons described in the acme http-01 documentation. (Short short answer, one of the most firewall friendly protocols/ports. Random ports can cause problems to some orgs. low ports can have extra security too)

How multiple hosts talking behind the same DNS record is handled.

I think this would be not allowed. Each node that wants to attest needs to have its own dns entry, and the selector returned is that dns name, so uniquely identifies the node. acme http-01 assumes this as well I believe.

DNSSEC support.

I think this is transparent for http. The dns entry the server looks up is just a bit more trustworthy.

If there was a pure dns attestor like the acme dns-01 challenge, then it would help that I think. But for the scope of this plugin, I'm thinking limiting it to http attestion utilizing dns just for hostname lookups? So akin to acme http-01 only.

Shape of the SPIFFE ID of agents attested by the attestor.

spiffe://$trustDomain/spire/agent/http/hostname/foo.example.org

Selectors produced by the server plugin.

http:hostname:foo.example.org

kfox1111 commented 5 months ago

I think the main thing left is finalizing the details of the communication flows?

I was thinking about @evan2645's suggestion of random ports again, and could see an advantage to that when having multiple spire-agents on the same node (used to attest to different spire servers). I also can see some benefits on restricting the port back to port 80 for easier internet traversal.

So maybe it should be a configurable on both sides? That the agent allows specifying port to use and passes it to the server and the server can force override the port to always be port 80 should the server be intended to traverse the internet? Maybe even defaulting to 80 unless the user overrides?

In that case, the config might be:

server:
  dns_patterns:   # Optional list of regexes dns hostname need to match. If empty, all dns entries are alllowed. If none match, the request is rejected.
  - <regex>
  - <regex>
  allow_alternate_ports: false # Optional flag. Defaults to false. Allow the agent to specify what port to use. Otherwise, it must be port 80.
agent:
  hostname: # Optional. If unset, use the hostname as detected on the node. If running in a container, this may need to be set explicitly.
  port: 80 # Optional port to listen on. Default is 80, and if not 80, some other webserver on the host needs to port forward to whatever port is chosen here.
  advertised_port: 80 # Optional port to tell the spire-server to use for contact. Defaults to port 80. Used along with the spire-server setting allow_alternate_ports=true
kfox1111 commented 5 months ago

Started to work up the documentation around this. https://github.com/spiffe/spire/pull/4909

And scaffolded a bit based on the x509pop plugins.

kfox1111 commented 5 months ago

hmm.... should the plugin be named 'http' or 'httppop'?

kfox1111 commented 4 months ago

The pr has reached the level of a workable prototype. It seems to attest, and when I set agent_ttl to something very small, it seems to reattest ok too.

It has very little error checking and no testing at the moment. Once we work through all the details, then those things can be added.