strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes
https://strimzi.io/
Apache License 2.0
4.81k stars 1.29k forks source link

[Enhancement]: easy way to identify restart of PODs delayed by the cluster operator #9272

Open dadufour opened 1 year ago

dadufour commented 1 year ago

Related problem

When we apply the annotation strimzi.io/manual-rolling-update=true to have a POD restarted by the cluster operator, it would be useful to be able to easily see that the request has been actually taken into account by the operator and more importantly to see that something prevents the operator from restarting the POD currently (for example, because there are currently under replicated partitions in the cluster).

This is particularly needed when using the drain cleaner: let's assume someone needs to perform maintenance on some k8s nodes and for this, he needs to evict all PODs on these nodes. With the drain cleaner, the eviction request is 'delegated' to the cluster operator. But then if the operator decides that it can't bring the POD down right now, the eviction is delayed and the maintenance operation is stuck until the operator decides to restart the POD.

Currently, one way to verify what the cluster operator is doing in this situation is to look at its logs which is something not convenient at all. So in the end, the person doing the k8s maintenance is blind and doesn't understand what is preventing his node from being drained.

Suggested solution

The cluster operator should provide an easy to access information about PODs that are requested to restart but can't be restarted currently (whatever the reason). Ideally, dedicated metrics would be a good solution as it easily allows for monitoring and alerting.

Alternatives

Possibly, k8s events could be a solution although less convenient because the user needs to know on which resource the events are logged

Additional context

In our company, we are using strimzi cluster operator to deploy several hundreds of kafka clusters into more than 100 k8s clusters. Hence it is crucial that the team which is performing maintenance on k8s identifies that a kafka POD is not being evicted because kafka health issue. It is important as well for the team operating kafka to be notified that some maintenance operation may be stuck. Because of the number of instances, the information should be easily accessible even by people without any knowledge on strimzi/kafka.

ppatierno commented 11 months ago

Discussed on the Community call on 02.11.2023: it seems to be a useful feature to have which anyway needs further discussion and a proposal.