knative-extensions / eventing-kafka-broker

Alternate Kafka Broker implementation.
Apache License 2.0
185 stars 117 forks source link

kafka-controller unable to bring up dispatcher pods after hitting quota issues during a cluster BOM update #4168

Open mdwhitley opened 2 days ago

mdwhitley commented 2 days ago

Describe the bug I have observed this issue in our production clusters under both 1.14 and 1.15 releases.

Initial issue was discovered during our upgrade from 1.14 => 1.15. Rollouts hung after we discovered that the knative-eventing namespace resource quota had been incorrectly modified during a region wide update. Both the ConfigMap and Secret resources were beyond the quota limits as a result. In 1 of the 2 clusters that hit this issue a cluster wide BOM update was also in progress. Once the quota issue was corrected, the cluster which was not undergoing a BOM update came back up successfully while the other did not.

The cluster that did not come back up was the one which also had the BOM update running. The update was immediately paused, but services did not come back online afterward. The primary impact was 1/2 dispatcher pods stuck in Terminating state:

kafka-broker-dispatcher-0                1/1     Running       0          15d
kafka-broker-dispatcher-1                1/1     Terminating   0          15d

with failed triggers reporting

failed to bind resource to pod: Internal error occurred: failed calling webhook "pods.defaulting.webhook.kafka.eventing.knative.dev": failed to call webhook: Post "[https://kafka-webhook-eventing.knative-eventing.svc:443/pods-defaulting?timeout=2s](https://kafka-webhook-eventing.knative-eventing.svc/pods-defaulting?timeout=2s)": context deadline exceeded

We have experienced some timeouts on webhooks before related to cluster BOM updates, even some that have prevented unrelated operators in non-knative namespaces to fail. A number of our configurations are modified to run with failurePolicy: Ignore as a result. pods.defaulting.webhook.kafka.eventing.knative.dev was still left with Fail, and after changing it to Ignore resulted in the dispatcher statefulsets not coming up because of

Warning  FailedCreate  66s (x15 over 2m47s)  statefulset-controller  create Pod kafka-broker-dispatcher-1 in StatefulSet kafka-broker-dispatcher failed error: Pod "kafka-broker-dispatcher-1" is invalid: spec.containers[0].volumeMounts[1].name: Not found: "contract-resources"

the kafka-broker-dispatcher-1 pod would not start due to the invalid statefulset template and lack of kafka-controller doing its thing. Within kafka-controller there were error logs reporting failures due to not being able to communicate to the kafka-broker-dispatcher-1 pod, which didn't exist anymore.

We tried various deployment/statefulset restarts which only resulted in the other working dispatcher pod to go down and not come back up putting all triggers in failed states.

We tried deleting all ConfigMap/Deployment/StatefulSet resources and doing another 1.15 deployment which resulted in the same stuck behavior. We also tried a downgrade to 1.14 with the same results.

Mitigations were put in place to manually define contract-resources volume and route all triggers into the single dispatcher pod to get eventing limping along. This has allowed us to finish running BOM updates on the cluster and keep normal operations during this time.

Not long after we mitigated the previous issue, BOM updates were started in additional clusters with stable/working 1.14 knative installations. Unfortunately, those clusters suffered the same quota issue and so the knative-eventing namespace was over quota for CM/Secrets, and both experienced partial degradation with dispatcher pods stuck in Terminating and kakfa-controller not properly starting up new ones. The same mitigations were put in place, though not ideal, as processing through 300+ triggers took around 2 hours to fully come back online.

To try and workaround the webhook timeout, I increased pods.defaulting.webhook.kafka.eventing.knative.dev timeout to 10s. When attempting an upgrade from 1.14 => 1.15 (with config changes) on the original cluster, everything rolled out as expected and came back up. I had the same result when upgrading one of the other impacted 1.14 clusters as well. Both of these clusters were post-BOM update. Our dev/pstg clusters also received a BOM update during this time (both had latest mentioned 1.15 changes) and both maintained expected availability during the entire time.

My working hypothesis is the quota issue combined with any pod movements/restarts results in this type of "stuck" behavior, which makes sense. This combined with a cluster BOM update, which on its own can cause API timeouts, was able to get us into a state which could not be automatically recovered once the quota issue was resolved.

Expected behavior

To Reproduce Potentially:

mdwhitley commented 2 days ago

We had another cluster start a BOM update tonight. I had already put in place the 1.15 upgrade + timeout changes and as soon as the update began on the cluster, all triggers began flipping from Ready to ConsumerBinding. With the 10s timeout the error returned from kafka-controller:

{"level":"error","ts":"2024-11-21T00:12:02.752Z","logger":"kafka-broker-controller","caller":"controller/controller.go:564","msg":"Reconcile error","commit":"7092bb9-dirty","knative.dev/pod":"kafka-controller-7b9d5f8f95-qhdxn","knative.dev/controller":"knative.dev.eventing-kafka-broker.control-plane.pkg.reconciler.consumer.Reconciler","knative.dev/kind":"internal.kafka.eventing.knative.dev.Consumer","knative.dev/traceid":"a32f1de7-3e78-478b-8ef3-d8b5800b1980","knative.dev/key":"conversation/839dd499-424e-40f3-9dff-bff69e9c6a2b-n4cq4","duration":5.197144629,"error":"failed to bind resource to pod: Internal error occurred: failed calling webhook \"pods.defaulting.webhook.kafka.eventing.knative.dev\": failed to call webhook: Post \"https://kafka-webhook-eventing.knative-eventing.svc:443/pods-defaulting?timeout=10s\": http: server gave HTTP response to HTTPS client","stacktrace":"knative.dev/pkg/controller.(*Impl).handleErr\n\tknative.dev/pkg@v0.0.0-20240716082220-4355f0c73608/controller/controller.go:564\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20240716082220-4355f0c73608/controller/controller.go:541\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20240716082220-4355f0c73608/controller/controller.go:489"}

After around 20 minutes, all our triggers are now in the unavailable state. No quota issues in this case. Just a working 1.15 installation one minute and then failing with the above error on BOM update.

pierDipi commented 1 day ago

@mdwhitley maybe you get http: server gave HTTP response to HTTPS client because the webhook server secret kafka-webhook-eventing-certs is not present.

That secret is populated by Knative to serve HTTPS requests for the defaulting webhook [1] and is part of the released artifacts [2], so it's expected to exist, when is not present, it defaults to serving no certs [3].

[1] https://github.com/knative/pkg/blob/a7fd9b10bb9febf537db69284723a5337adc0a50/webhook/certificates/certificates.go#L65-L110 [2] https://github.com/knative-extensions/eventing-kafka-broker/blob/main/control-plane/config/eventing-kafka-broker/200-webhook/400-webhook-secret.yaml [3] https://github.com/knative/pkg/blob/a7fd9b10bb9febf537db69284723a5337adc0a50/webhook/webhook.go#L194-L198

Can you confirm the state of the kafka-webhook-eventing-certs when you get http: server gave HTTP response to HTTPS client ?

Another part that we could improve is to define PodDistruptionBudget to force at least 1 (or more) webhook instance at any given point so that there is not webhook unavailability that causes the pod to not have the volume definition.

mdwhitley commented 1 day ago

@pierDipi The secret is present and does not appear to have been recreated since our original deployment

$ k get secrets -n knative-eventing
NAME                                          TYPE                                  DATA   AGE
eventing-controller-token-fstts               kubernetes.io/service-account-token   3      596d
eventing-webhook-certs                        Opaque                                3      596d
eventing-webhook-token-bdn52                  kubernetes.io/service-account-token   3      596d
kafka-broker-secret                           Opaque                                4      596d
kafka-controller-token-cch7j                  kubernetes.io/service-account-token   3      596d
kafka-webhook-eventing-certs                  Opaque                                3      596d
kafka-webhook-eventing-token-rnfgg            kubernetes.io/service-account-token   3      596d
knative-eventing-alert-secret                 Opaque                                2      596d
knative-kafka-broker-data-plane-token-dl588   kubernetes.io/service-account-token   3      596d
pingsource-mt-adapter-token-92f4d             kubernetes.io/service-account-token   3      596d
pierDipi commented 1 day ago

if you try to do a TLS handshake with the kafka-webhook-eventing server, does it succeed ? do you see any relevant logs?

mdwhitley commented 1 day ago

At present, yes. Checking from one of our dead letter pods that has openssl:

$ k exec {pod} -- bash -c "openssl s_client -connect kafka-webhook-eventing.knative-eventing.svc:443 -showcerts"
Connecting to 172.21.71.107
depth=0 O=knative.dev, CN=kafka-webhook-eventing.knative-eventing.svc
verify error:num=18:self-signed certificate
verify return:1
depth=0 O=knative.dev, CN=kafka-webhook-eventing.kCONNECTED(00000003)

I did not have debug logging enabled on the cluster from last night though, and no errors presented in the webhook pods.

mdwhitley commented 1 day ago

The BOM updates in question are upgrades from v1.28.14 to v1.28.15. In cases where we had a fully working 1.15 install, as soon as master nodes began maintenance, that is when triggers went down and webhook HTTP errors began.

In our clusters that have had a BOM update without disruption to Knative (dev/stg), those were upgraded v1.28.15 to v1.29.10.

mdwhitley commented 1 day ago

I've pulled the kafka-webhook logs and can observe most traffic stops to the pods around the time the BOM update begins

image

Incident begins around 18:56. No more remote admission controller occur when everything goes down.

pierDipi commented 16 hours ago

what is BOM update? In terms of operations, etc, what is that doing?

pierDipi commented 16 hours ago

s soon as master nodes began maintenance, that is when triggers went down

when you say, triggers went down, what does that actually mean? Stopped sending events or the status went to not ready? or both?

mdwhitley commented 12 hours ago

what is BOM update? In terms of operations, etc, what is that doing?

BOM updates for our cluster are upgrading k8 version (v1.28.15) as well as OS updates and other vulnerability updates across the nodes. It starts with master nodes, then edge nodes, then worker nodes.

when you say, triggers went down, what does that actually mean? Stopped sending events or the status went to not ready? or both?

Trigger is reporting Ready=False and Reason=BindInProgress. While in this state, events continue to flow through existing dispatcher pods. Once one of the pods goes down, moved due to node maintenance, it is stuck in Terminating state due to finalizers hung on kafka-controller due to the webhook errors. Inside the terminating dispatcher pod all threads have exited, so no more work is being done and now all triggers handled by that pod are completely offline and will not recover.