strimzi / strimzi-canary

Strimzi canary
Apache License 2.0
41 stars 29 forks source link

Bug: Missing EXPECTED_CLUSTER_SIZE leads to massive load on brokers #221

Open pantaoran opened 1 year ago

pantaoran commented 1 year ago

We observed that when EXPECTED_CLUSTER_SIZE is not set (or explicitly set to -1), this destroyed measured produce latencies. It seems that before (or during?) every request, Canary was trying to micro-manage the replicas and their leaders for the canary topic on the Kafka cluster, which was taking a lot of time and processing, resulting in extremely slow responses to the produce requests.

Average latencies as reported when EXPECTED_CLUSTER_SIZE is set correctly: 3-5ms Average latencies as reported when EXPECTED_CLUSTER_SIZE is NOT set: 1000-2000ms

Somehow the things that canary does on the cluster slow everything down dramatically. It also leads to an explosion in logs. With the correct setting, my empty brokers (2-broker cluster, no other clients running except Canary) logged around 8 lines per minute. When the cluster size setting is missing, they logged around 500 lines per minute (the canary reconcile interval was 10sec=default).

I don't know what Canary does in detail or why, but it feels like a bug to me.

The description in the README says that I should expect more partitions reassignment of the topic while the Kafka cluster is starting up and the brokers are coming one by one, but what I actually observe is that partitions are getting reassigned on every reconciliation (every 10sec), leading to redundant work on the brokers, which cause high produce latencies and increased log volume.

mschurenko commented 8 months ago

I'm experiencing the same thing. I get the following message on the kafka controller every 10 seconds:

[2024-03-07 00:32:32,957] INFO [Controller id=2] Successfully updated assignment of partition __strimzi_canary-1 to
ReplicaAssignment(replicas=2,3,1, addingReplicas=, removingReplicas=, observers=, targetObservers=None) (kafka.controller.KafkaController)