linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

Disk Usage with Prometheus Metric Reportic is not showing accurate disk usage #1964

Open davidfarlow43 opened 1 year ago

davidfarlow43 commented 1 year ago

We are using cruise control with MSK. Our disk usage in cruise control doesn't match the cloud watch graph for disk used percentage. Since this is msk we are using the prometheus metric sampler. For example, our brokers have 15.5tb of disk and cloud watch shows 69% used for broker 1. This would mean that broker 1 has 10.7 TB used. However, cruise control says that broker 1 has 8.46 TB used. Its not entirely clear how disk usage is calculated when using prometheus metric sampler.

mohitpali commented 1 year ago

Would you have any screenshots and more information on this ? Please share your configuration as well. You may have some partition information missing from Open Monitoring.

  1. Cruise Control uses kafka_log_Log_Value metric for each partition and then sums up all partition size to get the Broker level Disk information.

https://github.com/linkedin/cruise-control/blob/b4e44ec004e6f5e22bd1c4e203d92341ed9e1659/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/monitor/sampling/prometheus/DefaultPrometheusQuerySupplier.java#L193

  1. Please check the capacity defined in the capcityCores.json as well to see if the bytes are accurately added into the capacity.

  2. Disk information in /load api is updated in every run of Sampler. But sampling is paused (saving samples on Kafka Topic) when an Execution is going on. You may see a mismatch during an execution because the disk information will not be updated.

@efeg @CCisGG Please keep me honest here.