Kafka health checks and topics replication

Patanouk commented 2 years ago

Issue description

Hello,

We are using the Micronaut Kafka library, and have enabled the kafka health check, to mark our apps as unhealthy if our kafka cluster goes down

We have a kafka cluster with 3 brokers Our kafka producers are configured with Acknowledge.ALL We have the following configuration for our kafka brokers

offsets.topic.replication.factor: 3
min.insync.replicas: 2

According to the documentation in the kafka documentation for min.insync.replicas, a kafka producer with Acknowledge.ALL should be able to write to a kafka cluster, as long as 2 insync replicas are acknowledging the write

However, the healthcheck in the KafkaHealthIndicator compares the offset.topic.replication.factor to the number of available nodes

So in our case, with 3 brokers, a rollout restart of the kafka brokers will make one the kafka brokers unavailable Hence, the Kafka health check in the Micronaut Application fails when we rollout restart our kafka cluster (as only 2 nodes are healthy), even though our producers should be able to write to Kafka

Is there a reason to prefer the offsets.topic.replication.factor over the min.insync.replicas in the healthcheck? I'm relatively new to Kafka, so there might totally be something I'm missing there

GeitV commented 2 years ago

Having the same issue. Pretty critical issue for us, as health endpoint is used by K8S to check if pods are healthy or not. Had all of our pods go unhealthy because one Kafka node (out of 3) was restarting.

Patanouk commented 2 years ago

If that helps, we ended up writing our own health indicator by creating a class with a @Replaces(KafkaHealthIndicator.class) annotation
The healtcheck class was trying to write a message to the kafka cluster, and would return unhealthy if the write fails a couple of times in a row

I would be happy to open a PR if the maintainers of the project are agreeing with the approach proposed above

graemerocher commented 1 year ago

@Patanouk PRs welcome

micronaut-projects / micronaut-kafka

Kafka health checks and topics replication #471

Issue description