Open szvincze opened 2 days ago
I have experienced same issue when upgrading spire deployment through argoCD. The cluster spiffe id pops up a sync error:
Background: I have 3 replicas of spire server that running as statefulset. The spire controller manager (v0.6.0) is running in the same pod as spire server.
The workaround in my case is to re-create the cluster spiffe id resource through argoCD.
Hi @yongdu, Thanks for sharing your experience. The workaround you mentioned is also valid, but in our case it is not a way forward because the custom resource is created and handled by helm.
Are you using the https://github.com/spiffe/helm-charts-hardened chart or a different one?
A different one, an integration chart.
Have a look at the hardened chart then. It has some stuff in place to disable enforcement during the upgrade and then reenable it at the end.
Please assign this to me. I'd like to investigate the webhook issues further.
We have observed an issue during upgrade. In case we use more than 1 replicas of
spire-server
pod that includesspire-controller-manager
container besidesspire-server
and the deployment runs for more than 24 hours, the helm upgrade fails when it tries to patch theClusterSPIFFEID
custom resource:After some investigation I managed to reproduce the issue manually too. I changed the
x509SVIDTTL
in the code to 10 minutes to be able to test it faster, then just created aClusterSPIFFEID
when the deployment was in idle status for a while.I also added a printout in the log that shows when the webhook certificate was rotated and the expiration timestamp.
On the active
spire-controller-manager
the SVID is rotated properly, but on the standby it remains which was minted at initialization phase. It would not be a problem by default, but somehow during operation the standbyspire-controller-manager
's webhook gets requests, like my commands but the certificate has expired. It turned out from the printout that the certificate in question is the standby spire-controller-manager's (see the below expiration timestamp).In this case the third attempt was successful, sometimes the second is successful. Below you can see that there are TLS handshake errors in the log of the standby spire-controller-manager the same time when the reproduction succeeded.
If I continuously sent requests then I haven't managed to reproduce the issue, but when I wait some time then repeat the same command it throws the error. I reproduced it with v0.6.0 and the latest version too.
Right now we use a workaround that before upgrading (within a day) we restart the
spire-controller-manager
container that will mint a new certificate for the webhook and the upgrade succeeds.Do you have any idea how it could be fixed permanently?