knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.46k stars 1.14k forks source link

Webhook Flake on Upgrade #15145

Open dprotaso opened 2 months ago

dprotaso commented 2 months ago

I wonder if we are clearing certificates?

upgrade.go:98: Failed to create Service: Internal error occurred: failed calling webhook "webhook.serving.knative.dev": failed to call webhook: Post "https://webhook.db15bd17-dfe9-41c9-9dfb-dd8115ecfe22.svc:443/?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "webhook.db15bd17-dfe9-41c9-9dfb-dd8115ecfe22.svc")

Originally posted by @dprotaso in https://github.com/knative/serving/issues/15141#issuecomment-2066443436

skonto commented 1 month ago

@dprotaso is not true that the certificate reconciler fills in the secret with a certificate based on the service name of the webhook and during the upgrade we override the secret with empty content? I suspect the new webhook controller loads the new cert before it is filled in by the reconciler and thus the error. I think we need to keep the secret around and not update it or wait for the webhook or something? I am wondering if instead of just presenting the certificate with GetCertificate we should also link readiness with proper certificate content (it happens elsewhere too tbh https://github.com/cert-manager/cert-manager/issues/3045)?