spiffe / spire-controller-manager

Kubernetes controller manager that reconciles workload registration and federation relationships.
Apache License 2.0
55 stars 37 forks source link

Upgrade failure because of webhook certificate expired #450

Open szvincze opened 2 days ago

szvincze commented 2 days ago

We have observed an issue during upgrade. In case we use more than 1 replicas of spire-server pod that includes spire-controller-manager container besides spire-server and the deployment runs for more than 24 hours, the helm upgrade fails when it tries to patch the ClusterSPIFFEID custom resource:

[2024-11-27T17:00:26.181Z] Error: UPGRADE FAILED: cannot patch "spire-controller-manager" with kind ClusterSPIFFEID: Internal error occurred: failed calling webhook "spire-controller-manager-webhook-service.spire.svc": failed to call webhook: Post "https://spire-controller-manager-webhook-service.spire.svc:443/validate-spire-spiffe-io-v1alpha1-clusterspiffeid?timeout=10s": tls: failed to verify certificate: x509: certificate has expired or is not yet valid: current time 2024-11-27T17:00:25Z is after 2024-11-26T17:38:17Z

After some investigation I managed to reproduce the issue manually too. I changed the x509SVIDTTL in the code to 10 minutes to be able to test it faster, then just created a ClusterSPIFFEID when the deployment was in idle status for a while.

I also added a printout in the log that shows when the webhook certificate was rotated and the expiration timestamp.

On the active spire-controller-manager the SVID is rotated properly, but on the standby it remains which was minted at initialization phase. It would not be a problem by default, but somehow during operation the standby spire-controller-manager's webhook gets requests, like my commands but the certificate has expired. It turned out from the printout that the certificate in question is the standby spire-controller-manager's (see the below expiration timestamp).

Rotated at: 2024-12-02 10:14:28.780845864 +0000 UTC m=+0.026239026
Expires at: 2024-12-02 10:24:28 +0000 UTC
$ kubectl apply -f clusterspiffeid-patch.yaml
Error from server (InternalError): error when creating "clusterspiffeid-patch.yaml": Internal error occurred: failed calling webhook "vclusterspiffeid.kb.io": failed to call webhook: Post "https://spire-controller-manager-webhook-service.spire.svc:443/validate-spire-spiffe-io-v1alpha1-clusterspiffeid?timeout=10s": tls: failed to verify certificate: x509: certificate has expired or is not yet valid: current time 2024-12-02T10:46:28Z is after 2024-12-02T10:24:28Z

$ kubectl apply -f clusterspiffeid-patch.yaml
Error from server (InternalError): error when creating "clusterspiffeid-patch.yaml": Internal error occurred: failed calling webhook "vclusterspiffeid.kb.io": failed to call webhook: Post "https://spire-controller-manager-webhook-service.spire.svc:443/validate-spire-spiffe-io-v1alpha1-clusterspiffeid?timeout=10s": tls: failed to verify certificate: x509: certificate has expired or is not yet valid: current time 2024-12-02T10:46:38Z is after 2024-12-02T10:24:28Z

In this case the third attempt was successful, sometimes the second is successful. Below you can see that there are TLS handshake errors in the log of the standby spire-controller-manager the same time when the reproduction succeeded.

...
2024-12-02T10:14:28Z    INFO    controller-runtime.builder  Registering a validating webhook    {"GVK": "spire.spiffe.io/v1alpha1, Kind=ClusterSPIFFEID", "path": "/validate-spire-spiffe-io-v1alpha1-clusterspiffeid"}
2024-12-02T10:14:28Z    INFO    controller-runtime.webhook  Registering webhook {"path": "/validate-spire-spiffe-io-v1alpha1-clusterspiffeid"}
2024-12-02T10:14:28Z    INFO    setup   starting manager
2024-12-02T10:14:28Z    INFO    controller-runtime.metrics  Starting metrics server
2024-12-02T10:14:28Z    INFO    starting server {"name": "health probe", "addr": "[::]:8083"}
2024-12-02T10:14:28Z    INFO    controller-runtime.metrics  Serving metrics server  {"bindAddress": ":8080", "secure": false}
2024-12-02T10:14:28Z    INFO    controller-runtime.webhook  Starting webhook server
2024-12-02T10:14:28Z    INFO    controller-runtime.certwatcher  Updated current TLS certificate
2024-12-02T10:14:28Z    INFO    controller-runtime.webhook  Serving webhook server  {"host": "", "port": 9443}
2024-12-02T10:14:28Z    INFO    controller-runtime.certwatcher  Starting certificate watcher
I1202 10:14:28.884132      67 leaderelection.go:254] attempting to acquire leader lease spire/98c9c988.spiffe.io...
2024/12/02 10:46:28 http: TLS handshake error from 172.18.0.3:64383: remote error: tls: bad certificate
2024/12/02 10:46:38 http: TLS handshake error from 172.18.0.3:61149: remote error: tls: bad certificate

If I continuously sent requests then I haven't managed to reproduce the issue, but when I wait some time then repeat the same command it throws the error. I reproduced it with v0.6.0 and the latest version too.

Right now we use a workaround that before upgrading (within a day) we restart the spire-controller-manager container that will mint a new certificate for the webhook and the upgrade succeeds.

Do you have any idea how it could be fixed permanently?

yongdu commented 15 hours ago

I have experienced same issue when upgrading spire deployment through argoCD. The cluster spiffe id pops up a sync error:

image

Background: I have 3 replicas of spire server that running as statefulset. The spire controller manager (v0.6.0) is running in the same pod as spire server.

The workaround in my case is to re-create the cluster spiffe id resource through argoCD.

szvincze commented 15 hours ago

Hi @yongdu, Thanks for sharing your experience. The workaround you mentioned is also valid, but in our case it is not a way forward because the custom resource is created and handled by helm.

kfox1111 commented 7 hours ago

Are you using the https://github.com/spiffe/helm-charts-hardened chart or a different one?

szvincze commented 5 hours ago

A different one, an integration chart.

kfox1111 commented 5 hours ago

Have a look at the hardened chart then. It has some stuff in place to disable enforcement during the upgrade and then reenable it at the end.

faisal-memon commented 4 hours ago

Please assign this to me. I'd like to investigate the webhook issues further.