Open njtran opened 2 years ago
Typically knative webhooks on startup go through the following steps
For this flow the webhooks need to be constructed with the certificate controller in addition with the default/validating admission controllers
If you're getting bad certificate
I'm curious what cert is the webhook presenting and see what's defined in your CA bundle of the configured webhook (ie. ValidatingWebhookConfiguration and MutatingWebhookConfiguration)
The typical misconfiguration we see is if the liveness probe timeout of the webhook deployment is too low - it never gets a chance to become the leader and create the certificate. This is because K8s kills the container.
ie. https://github.com/vmware-tanzu/sources-for-knative/issues/356
It's interesting to see bad certificate
- that's a new one.
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen
. Mark the issue as
fresh by adding the comment /remove-lifecycle stale
.
This issue or pull request is stale because it has been open for 90 days with no activity.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
/lifecycle stale
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen
. Mark the issue as
fresh by adding the comment /remove-lifecycle stale
.
/reopen /lifecycle frozen
@dprotaso: Reopened this issue.
@njtran just following up if this is still an issue. I haven't had time to dig into this but this might be a good-first-issue
Hey @dprotaso, thanks for following up! I hadn't added more to this just because in our later releases we haven't seen this problem occur. I don't have a reproduction I can give you, but I know that I saw it not uncommonly in our earlier releases. Happy to dive into this with you if you want?
Did anything change in your later releases?
Yep. Here are the webhook definitions for what I believe had the issue:
And here are the webhook definitions now, where I haven't seen the issue in a while.
Sometimes the issue has been because of an old unreachable webhook left around due to Argo CR syncing.
Maybe you see something different though?
We think it might be something to do with Argo syncing some old versions of webhooks. Have you heard anything like this?
@dprotaso Is the below error a related issue?
Karpenter: v0.27.1 EKS: v1.25.6-eks-232056e ArgoCD: v2.6.7+5bcd846
ERROR webhook.WebhookCertificates Reconcile error {"commit": "7131be2-dirty", "knative.dev/traceid": "ce9f2f51-3ee5-4fdc-a1ce-19caa7807db5", "knative.dev/key": "karpenter/karpenter-cert", "duration": "45.405825ms", "error": "Operation cannot be fulfilled on secrets \"karpenter-cert\": the object has been modified; please apply your changes to the latest version and try again"}
I'm pretty sure I'm seeing this too:
Karpenter: v0.27.6 EKS: v1.25.12-eks-2d98532 ArgoCD: v2.8.3+77556d9
I'm seeing this while building out this cluster -- everything is "new" (no old versions of eks, karpenter, etc):
$ kubectl logs karpenter-provisioners-5cd99796cf-lrnbs -f
{"level":"info","ts":1695796080.7865567,"logger":"fallback","caller":"injection/injection.go:63","msg":"Starting informers..."}
2023/09/27 06:28:03 Registering 2 clients
2023/09/27 06:28:03 Registering 2 informer factories
2023/09/27 06:28:03 Registering 3 informers
2023/09/27 06:28:03 Registering 5 controllers
{"level":"INFO","time":"2023-09-27T06:28:03.864Z","logger":"controller","message":"Starting server","commit":"5a2fe84-dirty","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
{"level":"INFO","time":"2023-09-27T06:28:03.867Z","logger":"controller","message":"Starting server","commit":"5a2fe84-dirty","kind":"health probe","addr":"[::]:8081"}
I0927 06:28:03.970155 1 leaderelection.go:248] attempting to acquire leader lease infra/karpenter-leader-election...
{"level":"INFO","time":"2023-09-27T06:28:04.009Z","logger":"controller","message":"Starting informers...","commit":"5a2fe84-dirty"}
2023/09/27 06:28:05 http: TLS handshake error from 10.42.172.163:50520: remote error: tls: bad certificate
...
Hoping to get some insight on the following issue. Happy to hop on a call or slack huddle in the knative slack to give more info.
Expected Behavior
The webhook should work and not require a non-deterministic amount of container restarts for it to work.
Actual Behavior
Using defaulting and validating webhooks for Karpenter CRDs. When first installing Karpenter, we get the following error in the webhook container logs. Even after receiving a failure, the webhook container stays ready. The issue is resolved sometimes by restarting the container a non-deterministic amount of times.
This webhook is further proven broken when it blocks creation of the CRD because the certificate is signed by an unknown authority.
Here’s where this webhook was created in code that we started to see this issue after.
Steps to Reproduce the Problem
Additional Info