Open pantaoran opened 1 year ago
I'm experiencing the same thing. I get the following message on the kafka controller every 10 seconds:
[2024-03-07 00:32:32,957] INFO [Controller id=2] Successfully updated assignment of partition __strimzi_canary-1 to
ReplicaAssignment(replicas=2,3,1, addingReplicas=, removingReplicas=, observers=, targetObservers=None) (kafka.controller.KafkaController)
We observed that when EXPECTED_CLUSTER_SIZE is not set (or explicitly set to -1), this destroyed measured produce latencies. It seems that before (or during?) every request, Canary was trying to micro-manage the replicas and their leaders for the canary topic on the Kafka cluster, which was taking a lot of time and processing, resulting in extremely slow responses to the produce requests.
Average latencies as reported when EXPECTED_CLUSTER_SIZE is set correctly: 3-5ms Average latencies as reported when EXPECTED_CLUSTER_SIZE is NOT set: 1000-2000ms
Somehow the things that canary does on the cluster slow everything down dramatically. It also leads to an explosion in logs. With the correct setting, my empty brokers (2-broker cluster, no other clients running except Canary) logged around 8 lines per minute. When the cluster size setting is missing, they logged around 500 lines per minute (the canary reconcile interval was 10sec=default).
I don't know what Canary does in detail or why, but it feels like a bug to me.
The description in the README says that I should expect
more partitions reassignment of the topic while the Kafka cluster is starting up and the brokers are coming one by one
, but what I actually observe is that partitions are getting reassigned on every reconciliation (every 10sec), leading to redundant work on the brokers, which cause high produce latencies and increased log volume.