linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

Feature Request: Automatic CPU capacity detection via metric reporter #1802

Open kyguy opened 2 years ago

kyguy commented 2 years ago

Currently, broker capacities for resources such as CPU must either be provided by users through a capacity.json file when using the BrokerCapacityConfigFileResolver [1] or detected by a custom BrokerConfigCapacityResolver plugin. [2] Having the metric reporter detect and report the CPU capacity of brokers would save users and third-party applications from this burden! The metric reporter could use the following line to get the number or cores available to the broker:

Runtime.getRuntime().availableProcessors()

and then report that metric in the same manner it does for CPU utilization. Cruise Control could then use this reported capacity value when the allow_capacity_estimation flag is set to true to set the CPU capacities of brokers in the cluster model.

Let me know what you think! If it sounds like a reasonable request, I would be happy to contribute this feature!

[1] https://github.com/linkedin/cruise-control/blob/migrate_to_kafka_2_4/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/config/BrokerCapacityConfigFileResolver.java [2] https://github.com/linkedin/cruise-control/wiki/Pluggable-Components#broker-capacity-config-resolver

kyguy commented 2 years ago

Any thoughts/concerns on this @efeg?

efeg commented 2 years ago

@kyguy Thanks for the proposal and offer for contribution! I see the intention is to make it easier for users to resolve the capacity of brokers. To achieve that, I'd recommend using a custom BrokerCapacityConfigResolver, which would helps us keep the capacity information self-contained rather than being split into metrics reporter.

Cruise Control could then use this reported capacity value when the allow_capacity_estimation flag is set to true to set the CPU capacities of brokers in the cluster model.

allow_capacity_estimation has a different use today -- it checks whether a broker capacity can be estimated from other brokers in the cluster in case its capacity information is missing. To ensure backwards compatibility, I'd suggest maintaining the existing behavior.

kyguy commented 2 years ago

Thanks for the reply @efeg!

I see the intention is to make it easier for users to resolve the capacity of brokers. To achieve that, I'd recommend using a custom BrokerCapacityConfigResolver, which would helps us keep the capacity information self-contained rather than being split into metrics reporter.

Without the help of the CC metrics reporter for capacity information, a custom BrokerCapacityConfigResolver would be dependent on a hardware resource management system for this information. Having the capacity information gathered by CC metrics reporter helps Cruise Control be more self sufficient! Since the metric reporter already gathers CPU utilization information, wouldn't it make sense to gather CPU capacity information as well, especially since it gives the CPU utilization values more context when compared across hosts?

allow_capacity_estimation has a different use today -- it checks whether a broker capacity can be estimated from other brokers in the cluster in case its capacity information is missing. To ensure backwards compatibility, I'd suggest maintaining the existing behavior.

Understood! Maybe a new flag could be created or the reported capacity could be the default value used!

[1] https://github.com/kyguy/cruise-control/commit/a5385cbbefcd5602c6c8f142a02f6a40c89d603d