Open mdwhitley opened 2 days ago
We had another cluster start a BOM update tonight. I had already put in place the 1.15 upgrade + timeout changes and as soon as the update began on the cluster, all triggers began flipping from Ready to ConsumerBinding
. With the 10s timeout the error returned from kafka-controller
:
{"level":"error","ts":"2024-11-21T00:12:02.752Z","logger":"kafka-broker-controller","caller":"controller/controller.go:564","msg":"Reconcile error","commit":"7092bb9-dirty","knative.dev/pod":"kafka-controller-7b9d5f8f95-qhdxn","knative.dev/controller":"knative.dev.eventing-kafka-broker.control-plane.pkg.reconciler.consumer.Reconciler","knative.dev/kind":"internal.kafka.eventing.knative.dev.Consumer","knative.dev/traceid":"a32f1de7-3e78-478b-8ef3-d8b5800b1980","knative.dev/key":"conversation/839dd499-424e-40f3-9dff-bff69e9c6a2b-n4cq4","duration":5.197144629,"error":"failed to bind resource to pod: Internal error occurred: failed calling webhook \"pods.defaulting.webhook.kafka.eventing.knative.dev\": failed to call webhook: Post \"https://kafka-webhook-eventing.knative-eventing.svc:443/pods-defaulting?timeout=10s\": http: server gave HTTP response to HTTPS client","stacktrace":"knative.dev/pkg/controller.(*Impl).handleErr\n\tknative.dev/pkg@v0.0.0-20240716082220-4355f0c73608/controller/controller.go:564\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20240716082220-4355f0c73608/controller/controller.go:541\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20240716082220-4355f0c73608/controller/controller.go:489"}
After around 20 minutes, all our triggers are now in the unavailable state. No quota issues in this case. Just a working 1.15 installation one minute and then failing with the above error on BOM update.
@mdwhitley maybe you get http: server gave HTTP response to HTTPS client
because the webhook server secret kafka-webhook-eventing-certs
is not present.
That secret is populated by Knative to serve HTTPS requests for the defaulting webhook [1] and is part of the released artifacts [2], so it's expected to exist, when is not present, it defaults to serving no certs [3].
[1] https://github.com/knative/pkg/blob/a7fd9b10bb9febf537db69284723a5337adc0a50/webhook/certificates/certificates.go#L65-L110 [2] https://github.com/knative-extensions/eventing-kafka-broker/blob/main/control-plane/config/eventing-kafka-broker/200-webhook/400-webhook-secret.yaml [3] https://github.com/knative/pkg/blob/a7fd9b10bb9febf537db69284723a5337adc0a50/webhook/webhook.go#L194-L198
Can you confirm the state of the kafka-webhook-eventing-certs
when you get http: server gave HTTP response to HTTPS client
?
Another part that we could improve is to define PodDistruptionBudget to force at least 1 (or more) webhook instance at any given point so that there is not webhook unavailability that causes the pod to not have the volume definition.
@pierDipi The secret is present and does not appear to have been recreated since our original deployment
$ k get secrets -n knative-eventing
NAME TYPE DATA AGE
eventing-controller-token-fstts kubernetes.io/service-account-token 3 596d
eventing-webhook-certs Opaque 3 596d
eventing-webhook-token-bdn52 kubernetes.io/service-account-token 3 596d
kafka-broker-secret Opaque 4 596d
kafka-controller-token-cch7j kubernetes.io/service-account-token 3 596d
kafka-webhook-eventing-certs Opaque 3 596d
kafka-webhook-eventing-token-rnfgg kubernetes.io/service-account-token 3 596d
knative-eventing-alert-secret Opaque 2 596d
knative-kafka-broker-data-plane-token-dl588 kubernetes.io/service-account-token 3 596d
pingsource-mt-adapter-token-92f4d kubernetes.io/service-account-token 3 596d
if you try to do a TLS handshake with the kafka-webhook-eventing server, does it succeed ? do you see any relevant logs?
At present, yes. Checking from one of our dead letter pods that has openssl
:
$ k exec {pod} -- bash -c "openssl s_client -connect kafka-webhook-eventing.knative-eventing.svc:443 -showcerts"
Connecting to 172.21.71.107
depth=0 O=knative.dev, CN=kafka-webhook-eventing.knative-eventing.svc
verify error:num=18:self-signed certificate
verify return:1
depth=0 O=knative.dev, CN=kafka-webhook-eventing.kCONNECTED(00000003)
I did not have debug logging enabled on the cluster from last night though, and no errors presented in the webhook pods.
The BOM updates in question are upgrades from v1.28.14
to v1.28.15
. In cases where we had a fully working 1.15 install, as soon as master nodes began maintenance, that is when triggers went down and webhook HTTP errors began.
In our clusters that have had a BOM update without disruption to Knative (dev/stg), those were upgraded v1.28.15
to v1.29.10
.
I've pulled the kafka-webhook logs and can observe most traffic stops to the pods around the time the BOM update begins
Incident begins around 18:56. No more remote admission controller
occur when everything goes down.
what is BOM update? In terms of operations, etc, what is that doing?
s soon as master nodes began maintenance, that is when triggers went down
when you say, triggers went down, what does that actually mean? Stopped sending events or the status went to not ready? or both?
what is BOM update? In terms of operations, etc, what is that doing?
BOM updates for our cluster are upgrading k8 version (v1.28.15) as well as OS updates and other vulnerability updates across the nodes. It starts with master nodes, then edge nodes, then worker nodes.
when you say, triggers went down, what does that actually mean? Stopped sending events or the status went to not ready? or both?
Trigger is reporting Ready=False
and Reason=BindInProgress
. While in this state, events continue to flow through existing dispatcher pods. Once one of the pods goes down, moved due to node maintenance, it is stuck in Terminating
state due to finalizers hung on kafka-controller
due to the webhook errors. Inside the terminating dispatcher pod all threads have exited, so no more work is being done and now all triggers handled by that pod are completely offline and will not recover.
Describe the bug I have observed this issue in our production clusters under both 1.14 and 1.15 releases.
Initial issue was discovered during our upgrade from 1.14 => 1.15. Rollouts hung after we discovered that the
knative-eventing
namespace resource quota had been incorrectly modified during a region wide update. Both the ConfigMap and Secret resources were beyond the quota limits as a result. In 1 of the 2 clusters that hit this issue a cluster wide BOM update was also in progress. Once the quota issue was corrected, the cluster which was not undergoing a BOM update came back up successfully while the other did not.The cluster that did not come back up was the one which also had the BOM update running. The update was immediately paused, but services did not come back online afterward. The primary impact was 1/2 dispatcher pods stuck in Terminating state:
with failed triggers reporting
We have experienced some timeouts on webhooks before related to cluster BOM updates, even some that have prevented unrelated operators in non-knative namespaces to fail. A number of our configurations are modified to run with
failurePolicy: Ignore
as a result.pods.defaulting.webhook.kafka.eventing.knative.dev
was still left withFail
, and after changing it toIgnore
resulted in the dispatcher statefulsets not coming up because ofthe
kafka-broker-dispatcher-1
pod would not start due to the invalid statefulset template and lack of kafka-controller doing its thing. Within kafka-controller there were error logs reporting failures due to not being able to communicate to thekafka-broker-dispatcher-1
pod, which didn't exist anymore.We tried various deployment/statefulset restarts which only resulted in the other working dispatcher pod to go down and not come back up putting all triggers in failed states.
We tried deleting all ConfigMap/Deployment/StatefulSet resources and doing another 1.15 deployment which resulted in the same stuck behavior. We also tried a downgrade to 1.14 with the same results.
Mitigations were put in place to manually define
contract-resources
volume and route all triggers into the single dispatcher pod to get eventing limping along. This has allowed us to finish running BOM updates on the cluster and keep normal operations during this time.Not long after we mitigated the previous issue, BOM updates were started in additional clusters with stable/working 1.14 knative installations. Unfortunately, those clusters suffered the same quota issue and so the
knative-eventing
namespace was over quota for CM/Secrets, and both experienced partial degradation with dispatcher pods stuck in Terminating and kakfa-controller not properly starting up new ones. The same mitigations were put in place, though not ideal, as processing through 300+ triggers took around 2 hours to fully come back online.To try and workaround the webhook timeout, I increased
pods.defaulting.webhook.kafka.eventing.knative.dev
timeout to 10s. When attempting an upgrade from 1.14 => 1.15 (with config changes) on the original cluster, everything rolled out as expected and came back up. I had the same result when upgrading one of the other impacted 1.14 clusters as well. Both of these clusters were post-BOM update. Our dev/pstg clusters also received a BOM update during this time (both had latest mentioned 1.15 changes) and both maintained expected availability during the entire time.My working hypothesis is the quota issue combined with any pod movements/restarts results in this type of "stuck" behavior, which makes sense. This combined with a cluster BOM update, which on its own can cause API timeouts, was able to get us into a state which could not be automatically recovered once the quota issue was resolved.
Expected behavior
To Reproduce Potentially:
Initiate full cluster upgrade/cycle process Then watch for when dispatcher pods are moved off nodes that go under maintenance.
Knative release version 1.14 + 1.15
Additional context Add any other context about the problem here such as proposed priority