linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 585 forks source link

Cruise Control with MSK #2146

Open UdayaPriyaKannan opened 5 months ago

UdayaPriyaKannan commented 5 months ago

Looking for some help in getting Cruise Control working against an AWS MSK cluster. I tried setting up the configuration as per these instructions. All the metrics from MSK are pushed to Prometheus. We are not explicitly filtering any metrics. Also, from the CruiseControl host, we are able to access the JMX and Node metrics on ports 11001 and 11002 of the brokers directly. I was able to configure cruise-control server and UI successfully but I could see the below observations in Cruise control UI

Kafka cluster state metrics like partition count, replicas are visible but Kafka cluster load, Kafka partition load, Resource distribution tabs are not available stating GET request failure.

ERROR: Error processing GET request '/load' due to: 'com.linkedin.kafka.cruisecontrol.exception.KafkaCruiseControlException: com.linkedin.cruisecontrol.exception.NotEnoughValidWindowsException: There is no window available in range [-1, 1712057449014] (index [1, -1]). Window index (current: 0, oldest: 0).

I'm not able to dry-run any Kafka cluster administration tasks. Getting same exception as above.

Both Cruise Control and UI are latest from GitHub
The Kafka version in Amazon MSK is 3.2.0 and the CruiseControl version being used is 2.5.137. In the monitored windows, I could observe 0% training. Initially, we created the CruiseControlMetrics topic manually since it was not present and in the default configuration of MSK nodes auto.create.topics is set to false.
Topics
KafkaCruiseControlPartitionMetricSamples and KafkaCruiseControlModelTrainingSamples were created automatically and they have data in them whereas “CruiseControlMetrics” topic is empty. Also, I could see below line in the cruise control server logs
App info kafka.consumer for KafkaCruiseControlSampleStore-consumer-unregistered

marcelloromani commented 5 months ago

NotEnoughValidWindowsException means that CC hasn't been able to collect enough data yet about the MSK cluster.

In my experience metrics from MSK must be fetched from the OpenTelemetry ports using Prometheus. The default instructions do no work as with MSK you can't just "drop a jar in the Kafka server classpath".

I started my journey here: https://docs.aws.amazon.com/msk/latest/developerguide/cruise-control.html

UdayaPriyaKannan commented 4 months ago

WARN Skip generating metric sample for broker 2 because the following required metrics are missing [ALL_TOPIC_REPLICATION_BYTES_OUT, ALL_TOPIC_BYTES_OUT, BROKER_PRODUCE_TOTAL_TIME_MS_MEAN, BROKER_FOLLOWER_FETCH_LOCAL_TIME_MS_MAX, ALL_TOPIC_BYTES_IN, BROKER_PRODUCE_REQUEST_QUEUE_TIME_MS_MEAN, BROKER_CONSUMER_FETCH_TOTAL_TIME_MS_MEAN, BROKER_REQUEST_QUEUE_SIZE, ALL_TOPIC_FETCH_REQUEST_RATE, BROKER_CONSUMER_FETCH_REQUEST_QUEUE_TIME_MS_MAX, ALL_TOPIC_MESSAGES_IN_PER_SEC, BROKER_FOLLOWER_FETCH_TOTAL_TIME_MS_MAX, BROKER_CONSUMER_FETCH_LOCAL_TIME_MS_MEAN, BROKER_FOLLOWER_FETCH_REQUEST_QUEUE_TIME_MS_MEAN, ALL_TOPIC_PRODUCE_REQUEST_RATE, BROKER_FOLLOWER_FETCH_REQUEST_RATE, BROKER_PRODUCE_TOTAL_TIME_MS_MAX, BROKER_FOLLOWER_FETCH_LOCAL_TIME_MS_MEAN, BROKER_PRODUCE_LOCAL_TIME_MS_MEAN, BROKER_FOLLOWER_FETCH_TOTAL_TIME_MS_MEAN, BROKER_REQUEST_HANDLER_AVG_IDLE_PERCENT, BROKER_PRODUCE_REQUEST_QUEUE_TIME_MS_MAX, BROKER_CONSUMER_FETCH_LOCAL_TIME_MS_MAX, ALL_TOPIC_REPLICATION_BYTES_IN, BROKER_CONSUMER_FETCH_REQUEST_QUEUE_TIME_MS_MEAN, BROKER_PRODUCE_LOCAL_TIME_MS_MAX, BROKER_FOLLOWER_FETCH_REQUEST_QUEUE_TIME_MS_MAX, BROKER_RESPONSE_QUEUE_SIZE, BROKER_CONSUMER_FETCH_TOTAL_TIME_MS_MAX]. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingUtils)

I followed the instructions in the developer guide but a lot of broker metrics are missing. Please help me figure out whats wrong.

micr01996 commented 3 months ago

Hello @UdayaPriyaKannan, have you solved this issue? I'm getting the same and i have replicated conf from AWS labs. It seems that as we're already scraping metrics from MSK there is some conflicts happening. There is left big window to make sure that cc has enough time for getting the metrics.

UdayaPriyaKannan commented 3 months ago

@micr01996 No, the issue is not solved yet My training stopped at 20% I'm able to do a PLE dry run but other operations in Kafka cluster administration tab throws Not Enough Valid Windows exception. Kafka partition load, resource distribution tab also throws the same exception.