Bug: Missing EXPECTED_CLUSTER_SIZE leads to massive load on brokers

We observed that when EXPECTED_CLUSTER_SIZE is not set (or explicitly set to -1), this destroyed measured produce latencies. It seems that before (or during?) every request, Canary was trying to micro-manage the replicas and their leaders for the canary topic on the Kafka cluster, which was taking a lot of time and processing, resulting in extremely slow responses to the produce requests.

Average latencies as reported when EXPECTED_CLUSTER_SIZE is set correctly: 3-5ms Average latencies as reported when EXPECTED_CLUSTER_SIZE is NOT set: 1000-2000ms

Somehow the things that canary does on the cluster slow everything down dramatically. It also leads to an explosion in logs. With the correct setting, my empty brokers (2-broker cluster, no other clients running except Canary) logged around 8 lines per minute. When the cluster size setting is missing, they logged around 500 lines per minute (the canary reconcile interval was 10sec=default).

I don't know what Canary does in detail or why, but it feels like a bug to me.

The description in the README says that I should expect more partitions reassignment of the topic while the Kafka cluster is starting up and the brokers are coming one by one, but what I actually observe is that partitions are getting reassigned on every reconciliation (every 10sec), leading to redundant work on the brokers, which cause high produce latencies and increased log volume.

strimzi / strimzi-canary

Bug: Missing EXPECTED_CLUSTER_SIZE leads to massive load on brokers #221