kne trigger dispatcher deployment disappearing

p4p4 commented 3 years ago

Env:

Kyma version: 1.13.0 in Azure K8s Eveting config: Kafka based with Azure Event Hub Eventing source: CCv2

Description complete eventing breakdown last event was at 5:30 in the morning (see screenshot) And I see that there is no deployment for the kne-trigger-dispatcher (only for dispatchers) Also note that the dispatcher deployment resources were roughly re-created at that time

➜  ~ k -n knative-eventing get deployments
NAME                                         READY   UP-TO-DATE   AVAILABLE   AGE
eventing-controller                          1/1     1            1           129d
eventing-webhook                             1/1     1            1           129d
hybris-ccv2-d1-kyma-integration-dispatcher   1/1     1            1           8h
hybris-ccv2-p1-kyma-integration-dispatcher   1/1     1            1           8h
hybris-ccv2-s1-kyma-integration-dispatcher   1/1     1            1           8h
knative-eventing-kafka-channel-controller    1/1     1            1           91d
knative-kafka-channel                        1/1     1            1           44d
natss-ch-controller                          1/1     1            1           128d
natss-ch-dispatcher                          1/1     1            1           128d
sources-controller                           1/1     1            1           129d

Note that the integration-dispatcher deployments were re-created 8 hours ago and the linked kne-trigger-dispatchers are missing

we restarted those two

knative-eventing-kafka-channel-controller-6b7bb469b5-tx6j7 
knative-kafka-channel-7ff47b9b7-v7224

as they had issues connecting to azure eventhub.

{"level":"warn","ts":"2020-10-23T11:29:17.846Z","caller":"producer/producer.go:142","msg":"Kafka Error","error":"sasl_ssl://event-hubs-kyma-*********.servicebus.windows.net:9093/bootstrap: Connect to ipv4#13.69.64.14:9093 failed: Connection refused (after 100ms in state CONNECT)"

and then the kne-trigger-dispatchers were back again.

then events were flowing again

travis-minke-sap commented 3 years ago

Some initial thoughts after looking at logs and Grafana a bit...

The are some Azure EventHub connectivity issues which seem to be ancillary - usually the pod in question just restarts as we'd hope/expect.
It seems the default-kne-trigger-dgl-p1-dispatcher dispatcher pod (and the others) went down around 3:15am GMT and was not re-created until around 11:45 GMT when you restarted the controller. We don't know whether the associated Deployment was gone during that time, but are assuming so from the info provided above.
There are no Knative-Kafka Controller logs during this ~8.5hr gap, but from Grafana it appears the pod was running.
The Controller logs don't have any record of having deleted deployments on 10-23 (one instance on 10-22), so it is unclear how they came to be removed.
It seems that the controller was not in a good state and didn't re-reconcile the missing deployments which is something we've never encountered before.
Grafana showed a burst of network/disk/cpu around the time of the problem but it's inconclusive as to whether it's related at all or not. Also Pod Utilization in Grafana is at 97% (not sure about that metric).

travis-minke-sap commented 3 years ago

After some further investigation we have a possible explanation for the scenario described by this Issue. We are not able to conclusively state that this hypothesis is accurate, but it is the best explanation we've found that fits the description.

The deprecated knative-kafka implementation relies upon cross-namespace owner references, which is not supported by the Kubernetes specification. Despite this, K8S handles cascaded garbage collection for these resources successfully. It seems however, that there are scenarios in which the Kubernetes garbage collector can delete such resources. See the following issue for details... (specifically the comments on Feb 7 and May 4 which are similar to this scenario.)

https://github.com/kubernetes/kubernetes/issues/65200

We are theorizing that something similar occurred here, and the Kubernetes garbage collector deleted the Dispatcher's Deployment. It is still unclear why the Controller only re-reconciled the Dispatcher Deployments in the kyma-integration namespace, but not in the user namespaces (dgl-p1, etc).

Also, it should be noted that the broker-filter logs are still full of errors/retries which puts unnecessary load on the eventing infrastructure. The offending Subscriptions should be fixed or stopped to alleviate this burden.

And finally (fwiw)... I was able to verify that the KafkaChannel resources appear to NOT have been deleted/recreated during this downtime.

k15r commented 3 years ago

Workaround deliverd (updated channel controller with configurable timeout for forced reconcile)

kyma-project / kyma

kne trigger dispatcher deployment disappearing #9758