strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes
https://strimzi.io/
Apache License 2.0
4.65k stars 1.26k forks source link

Rolling Deployment(s) like EO, CC and KE should be clearer in the log #8589

Open ppatierno opened 1 year ago

ppatierno commented 1 year ago

When there is the need to re-generate ZK, K, EO and other components certificates (because CA renewal, using a new CA key and so on), the log from the rolling for ZK and K is pretty clear. For example, for Kafka, it has a section like the following showing the creation of a new server certificate ...

2023-05-31 13:20:21 INFO  Ca:382 - Reconciliation #78(timer) Kafka(default/my-cluster): Generating certificate Subject(organizationName='io.strimzi', commonName='my-cluster-kafka', dnsNames=[my-cluster-kafka-0.my-cluster-kafka-brokers.default.svc, my-cluster-kafka-brokers, my-cluster-kafka-bootstrap.default.svc.cluster.local, my-cluster-kafka-bootstrap.default.svc, my-cluster-kafka-brokers.default.svc, my-cluster-kafka-bootstrap.default, my-cluster-kafka-0.my-cluster-kafka-brokers.default.svc.cluster.local, my-cluster-kafka-brokers.default.svc.cluster.local, my-cluster-kafka-bootstrap, my-cluster-kafka-brokers.default], ipAddresses=[]), signed by CA cluster-ca
...
...
...

... followed by something like this highlighting the need for rolling the pod and using the new server cert ...

2023-05-31 13:20:21 INFO  KafkaRoller:551 - Reconciliation #78(timer) Kafka(default/my-cluster): Rolling Pod my-cluster-kafka-0/0 due to [Pod has old cluster-ca certificate generation, cluster-ca certificate renewal, Pod has old revision, Kafka broker TLS certificates updated]
2023-05-31 13:20:21 INFO  KafkaRoller:329 - Reconciliation #78(timer) Kafka(default/my-cluster): Will temporarily skip verifying pod my-cluster-kafka-0/0 is up-to-date due to ForceableProblem: Pod my-cluster-kafka-0 is controller and there are other pods to verify. Non-controller pods will be verified first, retrying after at least 250ms
2023-05-31 13:20:21 INFO  KafkaRoller:551 - Reconciliation #78(timer) Kafka(default/my-cluster): Rolling Pod my-cluster-kafka-1/1 due to [Pod has old cluster-ca certificate generation, cluster-ca certificate renewal, Pod has old revision, Kafka broker TLS certificates updated]
2023-05-31 13:20:22 INFO  PodOperator:54 - Reconciliation #78(timer) Kafka(default/my-cluster): Rolling pod my-cluster-kafka-1

When it comes to do the same for the EO or CC (Cruise Control) for example, it's really not clear that a pod restarting is happening (just watching at the log maybe in a post-mortem analysis). For example in the following log, it's clear that new server certs are generated for EO (TO and UO) and CC but it's not visible when pods are rolled and ready (and they are rolled I can confirm that :-)).

2023-05-31 13:21:35 INFO  Ca:382 - Reconciliation #78(timer) Kafka(default/my-cluster): Generating certificate Subject(organizationName='io.strimzi', commonName='my-cluster-entity-topic-operator', dnsNames=[], ipAddresses=[]), signed by CA cluster-ca
2023-05-31 13:21:35 INFO  Ca:382 - Reconciliation #78(timer) Kafka(default/my-cluster): Generating certificate Subject(organizationName='io.strimzi', commonName='my-cluster-entity-user-operator', dnsNames=[], ipAddresses=[]), signed by CA cluster-ca
2023-05-31 13:21:36 INFO  ClusterOperator:139 - Triggering periodic reconciliation for namespace default
2023-05-31 13:21:36 INFO  AbstractOperator:380 - Reconciliation #78(timer) Kafka(default/my-cluster): Reconciliation is in progress
2023-05-31 13:22:28 INFO  Ca:382 - Reconciliation #78(timer) Kafka(default/my-cluster): Generating certificate Subject(organizationName='io.strimzi', commonName='my-cluster-cruise-control', dnsNames=[my-cluster-cruise-control.default.svc, localhost, my-cluster-cruise-control.default, my-cluster-cruise-control, my-cluster-cruise-control.default.svc.cluster.local], ipAddresses=[]), signed by CA cluster-ca
2023-05-31 13:22:36 INFO  AbstractOperator:380 - Reconciliation #78(timer) Kafka(default/my-cluster): Reconciliation is in progress
2023-05-31 13:22:49 INFO  CrdOperator:133 - Reconciliation #78(timer) Kafka(default/my-cluster): Status of Kafka my-cluster in namespace default has been updated
2023-05-31 13:22:49 INFO  OperatorWatcher:38 - Reconciliation #203(watch) Kafka(default/my-cluster): Kafka my-cluster in namespace default was MODIFIED
2023-05-31 13:22:49 INFO  AbstractOperator:510 - Reconciliation #78(timer) Kafka(default/my-cluster): reconciled

I can understand that the rolling of ZK and K pods are handled by ZooKeeperRoller and KafkaRoller and they have more control, while for deployments like EO and CC (even KE, Kafka Exporter) we are just relying on a "reconcile" of a Deployment, by patching it and leaving Kubernetes to restart the pod (relying on the AbstractNamespacedResourceOperator.reconcile method).

NOTE: The EO and CC rolling are logged when it happens for trusting a new CA cert because it goes through a different way in the CaReconciler.

tombentley commented 1 year ago

Triaged on 15/06/2023: This seems like a reasonable request. In the KafkaRoller we also make use of Kube Event API to provide additional context to people running Kafka about why these pods are getting restarted. Perhaps we should consider doing something similar here.

scholzj commented 1 year ago

We do not really know why are they restarted. So I don't think we can use an event. It also cannot be issued on the pod level as for Kafka because the pod is deleted.

karstengresch commented 6 months ago

It might be a timely coincidence, but shortly after I got a Kafka broker TLS certificates updated event, I couldn't connect via an external client to the cluster (as described either here or here).

After extracting the ca-cert again, of course. Since then, I just get these

terminated during authentication. This may happen due to any of the following reasons: (1) Authentication failed due to invalid credentials with brokers older than 1.0.0, (2) Firewall blocking Kafka TLS traffic (eg it may only allow HTTPS traffic), (3) Transient network issue. (org.apache.kafka.clients.NetworkClient)

errors.

Could there be any relation to these rolling deployments?

scholzj commented 6 months ago

@karstengresch That does not seem related to me: