Open ppatierno opened 1 year ago
Triaged on 15/06/2023: This seems like a reasonable request. In the KafkaRoller we also make use of Kube Event
API to provide additional context to people running Kafka about why these pods are getting restarted. Perhaps we should consider doing something similar here.
We do not really know why are they restarted. So I don't think we can use an event. It also cannot be issued on the pod level as for Kafka because the pod is deleted.
It might be a timely coincidence, but shortly after I got a Kafka broker TLS certificates updated
event, I couldn't connect via an external client to the cluster (as described either here or here).
After extracting the ca-cert again, of course. Since then, I just get these
terminated during authentication. This may happen due to any of the following reasons: (1) Authentication failed due to invalid credentials with brokers older than 1.0.0, (2) Firewall blocking Kafka TLS traffic (eg it may only allow HTTPS traffic), (3) Transient network issue. (org.apache.kafka.clients.NetworkClient)
errors.
Could there be any relation to these rolling deployments?
@karstengresch That does not seem related to me:
When there is the need to re-generate ZK, K, EO and other components certificates (because CA renewal, using a new CA key and so on), the log from the rolling for ZK and K is pretty clear. For example, for Kafka, it has a section like the following showing the creation of a new server certificate ...
... followed by something like this highlighting the need for rolling the pod and using the new server cert ...
When it comes to do the same for the EO or CC (Cruise Control) for example, it's really not clear that a pod restarting is happening (just watching at the log maybe in a post-mortem analysis). For example in the following log, it's clear that new server certs are generated for EO (TO and UO) and CC but it's not visible when pods are rolled and ready (and they are rolled I can confirm that :-)).
I can understand that the rolling of ZK and K pods are handled by
ZooKeeperRoller
andKafkaRoller
and they have more control, while for deployments like EO and CC (even KE, Kafka Exporter) we are just relying on a "reconcile" of a Deployment, by patching it and leaving Kubernetes to restart the pod (relying on theAbstractNamespacedResourceOperator.reconcile
method).NOTE: The EO and CC rolling are logged when it happens for trusting a new CA cert because it goes through a different way in the
CaReconciler
.