Closed p4p4 closed 3 years ago
Some initial thoughts after looking at logs and Grafana a bit...
The are some Azure EventHub connectivity issues which seem to be ancillary - usually the pod in question just restarts as we'd hope/expect.
It seems the default-kne-trigger-dgl-p1-dispatcher
dispatcher pod (and the others) went down around 3:15am GMT and was not re-created until around 11:45 GMT when you restarted the controller. We don't know whether the associated Deployment was gone during that time, but are assuming so from the info provided above.
There are no Knative-Kafka Controller logs during this ~8.5hr gap, but from Grafana it appears the pod was running.
The Controller logs don't have any record of having deleted deployments on 10-23 (one instance on 10-22), so it is unclear how they came to be removed.
It seems that the controller was not in a good state and didn't re-reconcile the missing deployments which is something we've never encountered before.
Grafana showed a burst of network/disk/cpu around the time of the problem but it's inconclusive as to whether it's related at all or not. Also Pod Utilization in Grafana is at 97% (not sure about that metric).
After some further investigation we have a possible explanation for the scenario described by this Issue. We are not able to conclusively state that this hypothesis is accurate, but it is the best explanation we've found that fits the description.
The deprecated knative-kafka implementation relies upon cross-namespace owner references, which is not supported by the Kubernetes specification. Despite this, K8S handles cascaded garbage collection for these resources successfully. It seems however, that there are scenarios in which the Kubernetes garbage collector can delete such resources. See the following issue for details... (specifically the comments on Feb 7 and May 4 which are similar to this scenario.)
https://github.com/kubernetes/kubernetes/issues/65200
We are theorizing that something similar occurred here, and the Kubernetes garbage collector deleted the Dispatcher's Deployment. It is still unclear why the Controller only re-reconciled the Dispatcher Deployments in the kyma-integration
namespace, but not in the user namespaces (dgl-p1
, etc).
Also, it should be noted that the broker-filter logs are still full of errors/retries which puts unnecessary load on the eventing infrastructure. The offending Subscriptions should be fixed or stopped to alleviate this burden.
And finally (fwiw)... I was able to verify that the KafkaChannel resources appear to NOT have been deleted/recreated during this downtime.
Workaround deliverd (updated channel controller with configurable timeout for forced reconcile)
Env:
Kyma version: 1.13.0 in Azure K8s Eveting config: Kafka based with Azure Event Hub Eventing source: CCv2
Description complete eventing breakdown last event was at 5:30 in the morning (see screenshot) And I see that there is no deployment for the kne-trigger-dispatcher (only for dispatchers) Also note that the dispatcher deployment resources were roughly re-created at that time
Note that the
integration-dispatcher
deployments were re-created 8 hours ago and the linkedkne-trigger-dispatchers
are missingwe restarted those two
as they had issues connecting to azure eventhub.
and then the
kne-trigger-dispatcher
s were back again.then events were flowing again