Open rmb938 opened 4 months ago
@rmb938 thanks for report, do you mind sharing more details/evidence?
Yup I can provide some more details and evidence. Give me a day or so to recollect the data. Unfortunately I did not save my initial findings.
On Tue, Aug 27, 2024, 10:03 PM Maryan Hratson @.***> wrote:
@rmb938 https://github.com/rmb938 thanks for report, do you mind sharing more details/evidence?
— Reply to this email directly, view it on GitHub https://github.com/linkedin/cruise-control/issues/2155#issuecomment-2314106899, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEE6IMVHRC3UQPX5S6ZDFLZTU4XHAVCNFSM6AAAAABIH5HIUOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJUGEYDMOBZHE . You are receiving this because you were mentioned.Message ID: @.***>
I believe that I am seeing evidence of this in our kafka cluster. We use msk so not sure if that matters at all. What we see is that the disk usage in kafka reported by cruise control doesn't match what is reality in the cluster.
All our brokers have a 16tb disk, according to cruise control broker 5 is taking up the most disk at 8.55tb which is ~53%
But we can see looking at the aws cloud watch metrics that this is not accurate: broker 5 only shows 61% disk used. And the top broker by disk usage according to cloud watch is broker 9 at 67%.
Looking at prometheus metric kafka_log_Log_Value with the sum of all topic partition grouped by broker, we see that this matches what cloudwatch shows, broker 9 having the highest disk usage:
For some reason the cruise control is not reporting the right size, we do have compaction enabled on some fairly large topics so that lines up with what was previously reported. The affect of this inaccuracy is that I now have a pretty large disparity between brokers disk size because of this issue.
When looking at the output from kafka-log-dirs and comparing it to cruise control's partition load rest api, it seems like cruise control is showing a smaller disk amount.
This leads to the broker load showing less disk then it should, and the cluster not balancing disk correctly when disk is set as a goal.
Looking into this further it seems like CC is only reporting the partition disk size from the leader, it doesn't also use the partition disk sizes from the followers.
Most of the time the leaders and followers will have pretty close partition sizes so this issue doesn't matter as much. However taking into account that each Kafka broker runs it's log cleaner independently the sizes between every partition replica could be different.
In extremely large Kafka clusters that have hundreds of terabytes of data and billions of messages per topic this difference does add up and having cruise control be unaware of this when determining broker load does leave the cluster unbalanced.
In the worst possible case I have seen, it is around a 1-2TB difference between what kafka-log-dirs says and what CC reports as broker disk usage. But I've seen a difference from a few megabytes to around 100-200GB. This is relatively small compared to the overall cluster size, but without cruise control knowing about this the brokers do end up being unbalanced over time.