spiffe / spire

The SPIFFE Runtime Environment
https://spiffe.io
Apache License 2.0
1.77k stars 467 forks source link

Nested SPIRE Architecture, NestedA workload invoke NestedB worload error in one case. #5317

Open penghuazhou opened 1 month ago

penghuazhou commented 1 month ago

How to occuor: 1、scale up a new Root Server pod, i will generate a new ca. 2、scale up a new NestedB Server pod, i will generate a new intermediate ca. 3、scale up NestedB Agent, should worload svid sign by the new ca. 4、NestedA workload invoke NestedB worload error.

Background knowledge: 1、A new intermediate certificate will be prepared for the intermediate and root certificate when ttl/2. This new intermediate or root certificate will only be activated at ttl/6. 2、When preparing the intermediate certificate, it will ensure that the root certificate is synchronized to the nested server before preparing the intermediate certificate successfully. 3、Spire agent synchronizes the trust certificate every 5 seconds. 4、Spire agent will notify the workload of trust certificate changes every 5 seconds to 8 minutes.

image

MarcosDY commented 1 month ago

Force rotation feature may be able to help you to update the current bundle intermediates inside each nested SPIRE, this is still under development, you can track the status in force rotation project Original issue: https://github.com/spiffe/spire/issues/1934

penghuazhou commented 1 month ago

Force rotation feature may be able to help you to update the current bundle intermediates inside each nested SPIRE, this is still under development, you can track the status in force rotation project Original issue: #1934

@MarcosDY Force rotation feature update the current bundle intermediates inside each nested SPIRE, but alse need several seconds. During which the CA key generated by expanding the root server may have already issued a new nested server intermediate certificate, and the intermediate certificate may have already issued the workload's SVID. If this workload communicates with workloads that have not been synchronized to the bundle in a timely manner, it will cause TLS exceptions.

penghuazhou commented 1 month ago

I think we have two solution to solve this problem, What solution will the community plan adopt to solve this problem? I can commit a pr. 1、If scale up spire-server, new spire-server pod can copy ca from old pod to solve this problem. New spire-server rotate ca independent。 2、let spire-server share a ca key. Spire-server which get lock can rocate ca.

sorindumitru commented 1 month ago

I think the force rotation API by itself doesn't help, since it looks like you can only tell an existing server instance to prepare or rotate a CA. It would be good to have something (even within the force rotation APIs) that allows preparing a CA for use by a specific server instance at a later time. So you can:

  1. Prepare a CA for server instances, N+1 and N+2
  2. Wait for some amount of time for them to be propagated to all workloads
  3. Start instances N+1 and N+2 and have them use the prepared CA.

Alternatively maybe this could be something that the CLI command which starts the new instances up to when the CA is prepared and activated and then exits. That way you can run that new command as step 1 in the previous sequence.

evan2645 commented 1 month ago

Thanks for reporting this @penghuazhou and @sorindumitru for jumping in

We discussed this issue during SPIRE contributor sync today, and the consensus is that liveness and readiness checks in SPIRE should be solving this problem (but don't currently). When a new SPIRE Server boots at the root, readiness check should fail for ~some amount of time to allow the new root to propagate. After it's propagated, the readiness check can succeed and signing can begin.

I think there's a couple gotchas that need to be figured out as part of this work:

I'll move this issue to the backlog as unscoped ... once we have answers to the above two questions, I think we'll better understand the scope and be ready to accept the change. Thank you for volunteering to work on this @penghuazhou! I will go ahead and assign it to you as well.

penghuazhou commented 1 month ago

Thanks,I'm glad to be able to participate.