akacodemonkey commented 3 years ago

Looking for some help in getting Cruise Control working against an AWS MSK cluster. I've set up the configuration as per these instructions. The only difference is I'm hosting them within K8S. The Prometheus pod looks to be connecting to the jmx and node ports exposed with a number of metrics reported. Cruise Control UI loads up nicely and provides cluster-level information such as replicas, partitions etc. But if I attempt any active operation, for example, examine the cluster load I get the following error on the screen

Both Cruise Control and UI are latest from GitHub as of yesterday and Prometheus is image prometheus/prometheus:v2.26.0

Error processing GET request '/load' due to: 'com.linkedin.kafka.cruisecontrol.exception.KafkaCruiseControlException: com.linkedin.cruisecontrol.exception.NotEnoughValidWindowsException: There is no window available in range [-1, 1621439294521] (index [1, -1]). Window index (current: 0, oldest: 0).'.

The cruise control logs have the same information generated

2021-05-19 15:44:18,744] INFO Kicking off metric sampling for time range [1621438938744, 1621439058744], duration 120000 ms with timeout 120000 ms. (com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcherManager)
[2021-05-19 15:44:19,059] INFO Added 5025 metric values. Skipped 0 invalid query results. (com.linkedin.kafka.cruisecontrol.monitor.sampling.prometheus.PrometheusMetricSampler)
[2021-05-19 15:44:19,059] WARN Broker 1 is missing 9/63 topics metrics and 9/140 leader partition metrics. Missing leader topics: [...]. (com.linkedin.kafka.cruisecontrol.monitor.sampling.holder.BrokerLoad)
[2021-05-19 15:44:19,059] WARN Broker 2 is missing 14/66 topics metrics and 14/143 leader partition metrics. Missing leader topics: [...]. (com.linkedin.kafka.cruisecontrol.monitor.sampling.holder.BrokerLoad)
[2021-05-19 15:44:19,059] WARN Broker 3 is missing 8/64 topics metrics and 8/137 leader partition metrics. Missing leader topics: [...]. (com.linkedin.kafka.cruisecontrol.monitor.sampling.holder.BrokerLoad)
[2021-05-19 15:44:19,060] WARN Skip generating metric sample for broker 2 because the following required metrics are missing [BROKER_CPU_UTIL]. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingUtils)
[2021-05-19 15:44:19,060] WARN Skip generating metric sample for broker 1 because the following required metrics are missing [BROKER_CPU_UTIL]. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingUtils)
[2021-05-19 15:44:19,060] WARN Skip generating metric sample for broker 3 because the following required metrics are missing [BROKER_CPU_UTIL]. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingUtils)
[2021-05-19 15:44:19,060] INFO Generated 0(420 skipped by broker {1=140, 2=143, 3=137}) partition metric samples and 0(3 skipped) broker metric samples for timestamp 1621439058000. (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor)
[2021-05-19 15:44:19,060] INFO Collected 0 partition metric samples for 0 partitions. Total partition assigned: 420. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher)
[2021-05-19 15:44:19,060] INFO Collected 0 broker metric samples for 0 brokers. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher)
[2021-05-19 15:44:19,060] INFO Finished sampling in 315 ms. (com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcherManager)
[2021-05-19 15:44:19,465] INFO Skipping proposal precomputing because load monitor does not have enough snapshots. (com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer)
[2021-05-19 15:44:49,466] INFO Skipping proposal precomputing because load monitor does not have enough snapshots. (com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer)
[2021-05-19 15:45:19,466] INFO Skipping proposal precomputing because load monitor does not have enough snapshots. (com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer)
[2021-05-19 15:45:49,467] INFO Skipping proposal precomputing because load monitor does not have enough snapshots. (com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer)
[2021-05-19 15:46:18,744] INFO Kicking off metric sampling for time range [1621439058744, 1621439178744], duration 120000 ms with timeout 120000 ms. (com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcherManager)
[2021-05-19 15:46:18,983] INFO Added 5025 metric values. Skipped 0 invalid query results. (com.linkedin.kafka.cruisecontrol.monitor.sampling.prometheus.PrometheusMetricSampler)
[2021-05-19 15:46:18,983] WARN Broker 1 is missing 9/63 topics metrics and 9/140 leader partition metrics. Missing leader topics: [...]. (com.linkedin.kafka.cruisecontrol.monitor.sampling.holder.BrokerLoad)
[2021-05-19 15:46:18,984] WARN Broker 2 is missing 14/66 topics metrics and 14/143 leader partition metrics. Missing leader topics: [JANTEST... BrokerLoad)
[2021-05-19 15:46:18,984] WARN Broker 3 is missing 8/64 topics metrics and 8/137 leader partition metrics. Missing leader topics: [... ]. (com.linkedin.kafka.cruisecontrol.monitor.sampling.holder.BrokerLoad)
[2021-05-19 15:46:18,984] WARN Skip generating metric sample for broker 2 because the following required metrics are missing [BROKER_CPU_UTIL]. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingUtils)
[2021-05-19 15:46:18,984] WARN Skip generating metric sample for broker 3 because the following required metrics are missing [BROKER_CPU_UTIL]. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingUtils)
[2021-05-19 15:46:18,984] WARN Skip generating metric sample for broker 1 because the following required metrics are missing [BROKER_CPU_UTIL]. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingUtils)
[2021-05-19 15:46:18,985] INFO Generated 0(420 skipped by broker {1=140, 2=143, 3=137}) partition metric samples and 0(3 skipped) broker metric samples for timestamp 1621439178000. (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor)
[2021-05-19 15:46:18,985] INFO Collected 0 partition metric samples for 0 partitions. Total partition assigned: 420. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher)
[2021-05-19 15:46:18,985] INFO Collected 0 broker metric samples for 0 brokers. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher)
[2021-05-19 15:46:18,985] INFO Finished sampling in 241 ms. (com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcherManager)
[2021-05-19 15:46:19,467] INFO Skipping proposal precomputing because load monitor does not have enough snapshots. (com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer)
[2021-05-19 15:46:49,468] INFO Skipping proposal precomputing because load monitor does not have enough snapshots. (com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer)
[2021-05-19 15:47:19,468] INFO Skipping proposal precomputing because load monitor does not have enough snapshots. (com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer)
[2021-05-19 15:47:49,469] INFO Skipping proposal precomputing because load monitor does not have enough snapshots. (com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer)
[2021-05-19 15:47:53,066] WARN Skipping goal violation detection for RackAwareGoal because load completeness requirement is not met. (com.linkedin.kafka.cruisecontrol.detector.GoalViolationDetector)
[2021-05-19 15:47:53,067] WARN Skipping goal violation detection for ReplicaCapacityGoal because load completeness requirement is not met. (com.linkedin.kafka.cruisecontrol.detector.GoalViolationDetector)
[2021-05-19 15:47:53,067] WARN Skipping goal violation detection for DiskCapacityGoal because load completeness requirement is not met. (com.linkedin.kafka.cruisecontrol.detector.GoalViolationDetector)
[2021-05-19 15:47:53,067] WARN Skipping goal violation detection for NetworkInboundCapacityGoal because load completeness requirement is not met. (com.linkedin.kafka.cruisecontrol.detector.GoalViolationDetector)
[2021-05-19 15:47:53,067] WARN Skipping goal violation detection for NetworkOutboundCapacityGoal because load completeness requirement is not met. (com.linkedin.kafka.cruisecontrol.detector.GoalViolationDetector)
[2021-05-19 15:47:53,068] WARN Skipping goal violation detection for CpuCapacityGoal because load completeness requirement is not met. (com.linkedin.kafka.cruisecontrol.detector.GoalViolationDetector)
[2021-05-19 15:48:18,744] INFO Kicking off metric sampling for time range [1621439178744, 1621439298744], duration 120000 ms with timeout 120000 ms. (com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcherManager)

As it's MSK I'm unable to add in the metrics reporter, so do I need traffic flowing through all or some of the topics to generate the metrics? It's a lightly used development cluster, so very little data flows. Is my assumption correct that these errors relate to lack of snapshots relate to the lack of metrics as no traffic is flowing through the cluster?

akacodemonkey commented 3 years ago

Looking in further detail, this query is always returning null "1 - avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[1m]))"); Further investigation showed that our Prometheus pod was only scraping every minute, so no rate could be calculated. So case closed.

hieu29791 commented 2 years ago

Hi @akacodemonkey. I'm facing a similar issue. I build & deploy Cruise control (including UI) to the EKS cluster. Prometheus can scrape metrics from MSK. But cruise control has some issues same like you posted. Any idea how to fix this?
Cruise control version:

Openjdk11
CRUISE_CONTROL_VERSION=2.5.42 
CRUISE_UI_VERSION=0.3.4

Error:

joseppla commented 2 years ago

put a lower scrape interval, like 15s and you'll get the rates.

reidmeyer commented 1 year ago

Heyo, I'm also struggling to get Cruise Control working with AWS MSK. I have the same two errors as above.

@akacodemonkey, did you get it to work?

@joseppla, can you confirm, are you suggesting the the scrape interval of the promtheus itself be decreased, or the sampling interval in the cruisecontrol.properties.. or?

I also have little data on my dev cluster.

after setting my scrape interval to 15 seconds for my prometheus, still no luck. I am using custom deployed prometheus, rather than aws, but shouldn't make a difference.

But ya, I'm getting no windows, effectively; same error as above.

perhaps I'm missing certain acls.. will try making a new user with more privileges.

update: still no luck.

[2023-08-29 14:30:57,619] INFO Generated 0(144 skipped by broker {1=48, 2=49, 3=47}) partition metric samples and 0(3 skipped) broker metric samples for timestamp 1693312254000. (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor)

why is it skipping?

perhaps because of: WARN Skip generating metric sample for broker 2 because the following required metrics are missing [BROKER_CPU_UTIL]

I'm on a t3.small kafka. i saw somewhere that might be an issue.

reidmeyer commented 1 year ago

These are the ACLs for my cruise control principal, by the way, in case it's related.

I don't have a reporter running on the msk, as in the original posters question.

Lowering the scrape interval didn't help.

I am running t3.small.kafka.

Low data on the topic. I will wait 10 minutes to be sure i've given the startup enough time.

Current ACLs for resource ResourcePattern(resourceType=TOPIC, name=cruisecontrol., patternType=PREFIXED): (principal=User:CN=cruise-control.client.de.kpn.org, host=, operation=WRITE, permissionType=ALLOW) (principal=User:CN=cruise-control.client.de.kpn.org, host=, operation=DELETE, permissionType=ALLOW) (principal=User:CN=cruise-control.client.de.kpn.org, host=, operation=READ, permissionType=ALLOW) (principal=User:CN=cruise-control.client.de.kpn.org, host=, operation=DESCRIBE, permissionType=ALLOW) (principal=User:CN=cruise-control.client.de.kpn.org, host=, operation=CREATE, permissionType=ALLOW) (principal=User:CN=cruise-control.client.de.kpn.org, host=, operation=DESCRIBE_CONFIGS, permissionType=ALLOW)

Current ACLs for resource ResourcePattern(resourceType=TOPIC, name=*, patternType=LITERAL): (principal=User:CN=cruise-control.client.de.kpn.org, host=, operation=DESCRIBE_CONFIGS, permissionType=ALLOW) (principal=User:CN=cruise-control.client.de.kpn.org, host=, operation=DESCRIBE, permissionType=ALLOW)

Current ACLs for resource ResourcePattern(resourceType=TOPIC, name=__KafkaCruiseControlPartitionMetricSamples, patternType=LITERAL): (principal=User:CN=cruise-control.client.de.kpn.org, host=, operation=READ, permissionType=ALLOW) (principal=User:CN=cruise-control.client.de.kpn.org, host=, operation=WRITE, permissionType=ALLOW) (principal=User:CN=cruise-control.client.de.kpn.org, host=*, operation=DESCRIBE, permissionType=ALLOW)

Current ACLs for resource ResourcePattern(resourceType=CLUSTER, name=kafka-cluster, patternType=LITERAL): (principal=User:CN=cruise-control.client.de.kpn.org, host=, operation=DESCRIBE, permissionType=ALLOW) (principal=User:CN=cruise-control.client.de.kpn.org, host=, operation=CREATE, permissionType=ALLOW) (principal=User:CN=cruise-control.client.de.kpn.org, host=, operation=DESCRIBE_CONFIGS, permissionType=ALLOW) (principal=User:CN=cruise-control.client.de.kpn.org, host=, operation=ALTER, permissionType=ALLOW) (principal=User:CN=cruise-control.client.de.kpn.org, host=*, operation=IDEMPOTENT_WRITE, permissionType=ALLOW)

Current ACLs for resource ResourcePattern(resourceType=GROUP, name=cruisecontrol., patternType=PREFIXED): (principal=User:CN=cruise-control.client.de.kpn.org, host=, operation=READ, permissionType=ALLOW) (principal=User:CN=cruise-control.client.de.kpn.org, host=, operation=DELETE, permissionType=ALLOW) (principal=User:CN=cruise-control.client.de.kpn.org, host=*, operation=DESCRIBE, permissionType=ALLOW)

Current ACLs for resource ResourcePattern(resourceType=TOPIC, name=__CruiseControlMetrics, patternType=LITERAL): (principal=User:CN=cruise-control.client.de.kpn.org, host=, operation=WRITE, permissionType=ALLOW) (principal=User:CN=cruise-control.client.de.kpn.org, host=, operation=READ, permissionType=ALLOW)

Current ACLs for resource ResourcePattern(resourceType=TOPIC, name=__KafkaCruiseControlModelTrainingSamples, patternType=LITERAL): (principal=User:CN=cruise-control.client.de.kpn.org, host=, operation=READ, permissionType=ALLOW) (principal=User:CN=cruise-control.client.de.kpn.org, host=, operation=DESCRIBE, permissionType=ALLOW) (principal=User:CN=cruise-control.client.de.kpn.org, host=*, operation=WRITE, permissionType=ALLOW)

Current ACLs for resource ResourcePattern(resourceType=TRANSACTIONAL_ID, name=cruisecontrol, patternType=PREFIXED): (principal=User:CN=cruise-control.client.de.kpn.org, host=, operation=DESCRIBE, permissionType=ALLOW) (principal=User:CN=cruise-control.client.de.kpn.org, host=, operation=WRITE, permissionType=ALLOW)

Errors include:

[2023-08-30 13:22:12,949] WARN Broker 1 is missing 5/13 topics metrics and 40/48 leader partition metrics. Missing leader topics: [KafkaCruiseControlPartitionMetricSamples, amazon_msk_canary_state, amazon_msk_canary, KafkaCruiseControlModelTrainingSamples, consumer_offsets]. (com.linkedin.kafka.cruisecontrol.monitor.sampling.holder.BrokerLoad) [2023-08-30 13:22:12,953] WARN Broker 2 is missing 5/13 topics metrics and 41/49 leader partition metrics. Missing leader topics: [KafkaCruiseControlPartitionMetricSamples, amazon_msk_canary_state, amazon_msk_canary, KafkaCruiseControlModelTrainingSamples, consumer_offsets]. (com.linkedin.kafka.cruisecontrol.monitor.sampling.holder.BrokerLoad) [2023-08-30 13:22:12,954] WARN Broker 3 is missing 5/13 topics metrics and 39/47 leader partition metrics. Missing leader topics: [__amazon_msk_canary_state, KafkaCruiseControlPartitionMetricSamples, amazon_msk_canary, KafkaCruiseControlModelTrainingSamples, consumer_offsets]. (com.linkedin.kafka.cruisecontrol.monitor.sampling.holder.BrokerLoad)

[2023-08-30 13:22:12,956] WARN Skip generating metric sample for broker 2 because the following required metrics are missing [BROKER_CPU_UTIL]. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingUtils) [2023-08-30 13:22:12,957] WARN Skip generating metric sample for broker 3 because the following required metrics are missing [BROKER_CPU_UTIL]. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingUtils) [2023-08-30 13:22:12,957] WARN Skip generating metric sample for broker 1 because the following required metrics are missing [BROKER_CPU_UTIL]. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingUtils)

perhaps i need the most granular metrics reporting on msk? rather than the middle option?

realised my prometheus isn't scraping from the node exporter. so will try that.

add the 11002 node exporter to my prometheus, as well as stopping getting rid of the metrics that start with underscores (antique code i had), got rid of the broker_cpu_util error, as well as the missing 5/13 topics metrics error.

Everything works now!

UdayaPriyaKannan commented 6 months ago

@reidmeyer I'm also struggling to get Cruise Control working with AWS MSK. I'm getting the same "Not Enough Valid windows exception". Can you pls share your config file so that I can see if I missed something

reidmeyer commented 6 months ago

Hi @UdayaPriyaKannan,

I'm using the https://github.com/linkedin/cruise-control/blob/main/config/cruisecontrol.properties as a guide.

I'm using prometheus as my metrics endpoint, and i'm filling it with data from both the 11001 and the 11002 kafka ports.

this is my config:

` #

Copyright 2017 LinkedIn Corp. Licensed under the BSD 2-Clause License (the "License"). See License in the project root for license information.

#

# This is an example property file for Kafka Cruise Control. See com.linkedin.kafka.cruisecontrol.config.constants for more details.

# Configuration for the metadata client.
# =======================================

# The Kafka cluster to control.
bootstrap.servers={{ .Values.cruisecontrolProperties.bootstrapServers }}

# # SSL properties, needed if cluster is using TLS encryption
security.protocol=SSL
ssl.keystore.location=/opt/cruise-control/certs/keystore.jks
ssl.keystore.password=confluent
# ssl.truststore.location=truststore.jks

kafka.broker.failure.detection.enable=true
# topic.config.provider.class=com.linkedin.kafka.cruisecontrol.config.KafkaAdminTopicConfigProvider

# The maximum interval in milliseconds between two metadata refreshes.
#metadata.max.age.ms=300000

# Client id for the Cruise Control. It is used for the metadata client.
#client.id=kafka-cruise-control

# The size of TCP send buffer bytes for the metadata client.
#send.buffer.bytes=131072

# The size of TCP receive buffer size for the metadata client.
#receive.buffer.bytes=131072

# The time to wait before disconnect an idle TCP connection.
#connections.max.idle.ms=540000

# The time to wait before reconnect to a given host.
#reconnect.backoff.ms=50

# The time to wait for a response from a host after sending a request.
#request.timeout.ms=30000

# The time to wait for broker logdir to respond after sending a request.
#logdir.response.timeout.ms=10000

# Configurations for the load monitor
# =======================================

# The metric sampler class
# metric.sampler.class=com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsReporterSampler
metric.sampler.class=com.linkedin.kafka.cruisecontrol.monitor.sampling.prometheus.PrometheusMetricSampler

# Prometheus Metric Sampler specific configuration
prometheus.server.endpoint={{ .Values.cruisecontrolProperties.prometheusEndpoint }}

# True if the sampling process allows CPU capacity estimation of brokers used for CPU utilization estimation.
sampling.allow.cpu.capacity.estimation=true

# Configurations for CruiseControlMetricsReporterSampler
metric.reporter.topic=__CruiseControlMetrics

# The sample store class name
sample.store.class=com.linkedin.kafka.cruisecontrol.monitor.sampling.KafkaSampleStore

# The config for the Kafka sample store to save the partition metric samples
partition.metric.sample.store.topic=__KafkaCruiseControlPartitionMetricSamples

# The config for the Kafka sample store to save the model training samples
broker.metric.sample.store.topic=__KafkaCruiseControlModelTrainingSamples

# The replication factor of Kafka metric sample store topic
sample.store.topic.replication.factor=2

# The config for the number of Kafka sample store consumer threads
num.sample.loading.threads=8

# The partition assignor class for the metric samplers
metric.sampler.partition.assignor.class=com.linkedin.kafka.cruisecontrol.monitor.sampling.DefaultMetricSamplerPartitionAssignor

# The metric sampling interval in milliseconds
metric.sampling.interval.ms=120000

# The partition metrics window size in milliseconds
partition.metrics.window.ms=300000

# The number of partition metric windows to keep in memory. Partition-load-history = num.partition.metrics.windows * partition.metrics.window.ms
num.partition.metrics.windows=5

# The minimum partition metric samples required for a partition in each window
min.samples.per.partition.metrics.window=1

# The broker metrics window size in milliseconds
broker.metrics.window.ms=300000

# The number of broker metric windows to keep in memory. Broker-load-history = num.broker.metrics.windows * broker.metrics.window.ms
num.broker.metrics.windows=20

# The minimum broker metric samples required for a partition in each window
min.samples.per.broker.metrics.window=1

# The configuration for the BrokerCapacityConfigFileResolver (supports JBOD, non-JBOD, and heterogeneous CPU core capacities)
# capacity.config.file=config/capacityJBOD.json

# Change the capacity config file and specify its path; details below
capacity.config.file=/opt/cruise-control/config/capacityCores.json

# Configurations for the analyzer
# =======================================

# The list of goals to optimize the Kafka cluster for with pre-computed proposals -- consider using RackAwareDistributionGoal instead of RackAwareGoal in clusters with partitions whose replication factor > number of racks. The value must be a subset of the "goals" and a superset of the "hard.goals" and "self.healing.goals".
default.goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.PotentialNwOutGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.TopicReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderBytesInDistributionGoal

# The list of supported goals
goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.BrokerSetAwareGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.PotentialNwOutGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.TopicReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderBytesInDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.kafkaassigner.KafkaAssignerDiskUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.kafkaassigner.KafkaAssignerEvenRackAwareGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.PreferredLeaderElectionGoal

# The list of supported intra-broker goals
intra.broker.goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.IntraBrokerDiskCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.IntraBrokerDiskUsageDistributionGoal

# The list of supported hard goals -- consider using RackAwareDistributionGoal instead of RackAwareGoal in clusters with partitions whose replication factor > number of racks
hard.goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal

# The minimum percentage of well monitored partitions out of all the partitions
min.valid.partition.ratio=0.95

# The balance threshold for CPU
cpu.balance.threshold=1.1

# The balance threshold for disk
disk.balance.threshold=1.1

# The balance threshold for network inbound utilization
network.inbound.balance.threshold=1.1

# The balance threshold for network outbound utilization
network.outbound.balance.threshold=1.1

# The balance threshold for the replica count
replica.count.balance.threshold=1.1

# The capacity threshold for CPU in percentage
cpu.capacity.threshold=0.7

# The capacity threshold for disk in percentage
disk.capacity.threshold=0.8

# The capacity threshold for network inbound utilization in percentage
network.inbound.capacity.threshold=0.8

# The capacity threshold for network outbound utilization in percentage
network.outbound.capacity.threshold=0.8

# The threshold to define the cluster to be in a low CPU utilization state
cpu.low.utilization.threshold=0.0

# The threshold to define the cluster to be in a low disk utilization state
disk.low.utilization.threshold=0.0

# The threshold to define the cluster to be in a low network inbound utilization state
network.inbound.low.utilization.threshold=0.0

# The threshold to define the cluster to be in a low network outbound utilization state
network.outbound.low.utilization.threshold=0.0

# The metric anomaly percentile upper threshold
metric.anomaly.percentile.upper.threshold=90.0

# The metric anomaly percentile lower threshold
metric.anomaly.percentile.lower.threshold=10.0

# How often should the cached proposal be expired and recalculated if necessary
proposal.expiration.ms=60000

# The maximum number of replicas that can reside on a broker at any given time.
max.replicas.per.broker=10000

# The number of threads to use for proposal candidate precomputing.
num.proposal.precompute.threads=1

# the topics that should be excluded from the partition movement.
#topics.excluded.from.partition.movement

# The impact of having one level higher goal priority on the relative balancedness score.
#goal.balancedness.priority.weight

# The impact of strictness on the relative balancedness score.
#goal.balancedness.strictness.weight

# Configurations for the executor
# =======================================

# If true, appropriate zookeeper Client { .. } entry required in jaas file located at $base_dir/config/cruise_control_jaas.conf
# zookeeper.security.enabled=false

# The max number of partitions to move in/out on a given broker at a given time.
num.concurrent.partition.movements.per.broker=10

# The upper bound of partitions to move in cluster at a given time
max.num.cluster.partition.movements=1250

# The max number of partitions to move between disks within a given broker at a given time.
num.concurrent.intra.broker.partition.movements=2

# The max number of leadership movement within the whole cluster at a given time.
num.concurrent.leader.movements=1000

# Default replica movement throttle. If not specified, movements unthrottled by default.
# default.replication.throttle=

# The interval between two execution progress checks.
execution.progress.check.interval.ms=10000

# Configurations for anomaly detector
# =======================================

# The goal violation notifier class
anomaly.notifier.class=com.linkedin.kafka.cruisecontrol.detector.notifier.SelfHealingNotifier

# The metric anomaly finder class
metric.anomaly.finder.class=com.linkedin.kafka.cruisecontrol.detector.KafkaMetricAnomalyFinder

# The anomaly detection interval
#anomaly.detection.interval.ms=10000

# The goal violation to detect -- consider using RackAwareDistributionGoal instead of RackAwareGoal in clusters with partitions whose replication factor > number of racks
anomaly.detection.goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal

# The interested metrics for metric anomaly analyzer.
metric.anomaly.analyzer.metrics=BROKER_PRODUCE_LOCAL_TIME_MS_50TH,BROKER_PRODUCE_LOCAL_TIME_MS_999TH,BROKER_CONSUMER_FETCH_LOCAL_TIME_MS_50TH,BROKER_CONSUMER_FETCH_LOCAL_TIME_MS_999TH,BROKER_FOLLOWER_FETCH_LOCAL_TIME_MS_50TH,BROKER_FOLLOWER_FETCH_LOCAL_TIME_MS_999TH,BROKER_LOG_FLUSH_TIME_MS_50TH,BROKER_LOG_FLUSH_TIME_MS_999TH

# True if recently demoted brokers are excluded from optimizations during self healing, false otherwise
self.healing.exclude.recently.demoted.brokers=true

# True if recently removed brokers are excluded from optimizations during self healing, false otherwise
self.healing.exclude.recently.removed.brokers=true

# The zk path to store failed broker information.
failed.brokers.zk.path=/CruiseControlBrokerList

# Topic config provider class
topic.config.provider.class=com.linkedin.kafka.cruisecontrol.config.KafkaAdminTopicConfigProvider

# The cluster configurations for the TopicConfigProvider
cluster.configs.file=config/clusterConfigs.json

# The maximum time in milliseconds to store the response and access details of a completed kafka monitoring user task.
completed.kafka.monitor.user.task.retention.time.ms=86400000

# The maximum time in milliseconds to store the response and access details of a completed cruise control monitoring user task.
completed.cruise.control.monitor.user.task.retention.time.ms=86400000

# The maximum time in milliseconds to store the response and access details of a completed kafka admin user task.
completed.kafka.admin.user.task.retention.time.ms=604800000

# The maximum time in milliseconds to store the response and access details of a completed cruise control admin user task.
completed.cruise.control.admin.user.task.retention.time.ms=604800000

# The fallback maximum time in milliseconds to store the response and access details of a completed user task.
completed.user.task.retention.time.ms=86400000

# The maximum time in milliseconds to retain the demotion history of brokers.
demotion.history.retention.time.ms=1209600000

# The maximum time in milliseconds to retain the removal history of brokers.
removal.history.retention.time.ms=1209600000

# The maximum number of completed kafka monitoring user tasks for which the response and access details will be cached.
max.cached.completed.kafka.monitor.user.tasks=20

# The maximum number of completed cruise control monitoring user tasks for which the response and access details will be cached.
max.cached.completed.cruise.control.monitor.user.tasks=20

# The maximum number of completed kafka admin user tasks for which the response and access details will be cached.
max.cached.completed.kafka.admin.user.tasks=30

# The maximum number of completed cruise control admin user tasks for which the response and access details will be cached.
max.cached.completed.cruise.control.admin.user.tasks=30

# The fallback maximum number of completed user tasks of certain type for which the response and access details will be cached.
max.cached.completed.user.tasks=25

# The maximum number of user tasks for concurrently running in async endpoints across all users.
max.active.user.tasks=5

# Enable self healing for all anomaly detectors, unless the particular anomaly detector is explicitly disabled
self.healing.enabled=false

# Enable self healing for broker failure detector
#self.healing.broker.failure.enabled=true

# Enable self healing for goal violation detector
#self.healing.goal.violation.enabled=true

# Enable self healing for metric anomaly detector
#self.healing.metric.anomaly.enabled=true

# Enable self healing for disk failure detector
#self.healing.disk.failure.enabled=true

# Enable self healing for topic anomaly detector
#self.healing.topic.anomaly.enabled=true
#topic.anomaly.finder.class=com.linkedin.kafka.cruisecontrol.detector.TopicReplicationFactorAnomalyFinder

# Enable self healing for maintenance event detector
#self.healing.maintenance.event.enabled=true

# The multiplier applied to the threshold of distribution goals used by goal.violation.detector.
#goal.violation.distribution.threshold.multiplier=2.50

# configurations for the webserver
# ================================

# HTTP listen port
# webserver.http.port=8090

# HTTP listen address
webserver.http.address=0.0.0.0

# Whether CORS support is enabled for API or not
webserver.http.cors.enabled=false

# Value for Access-Control-Allow-Origin
webserver.http.cors.origin=http://localhost:8080/

# Value for Access-Control-Request-Method
webserver.http.cors.allowmethods=OPTIONS,GET,POST

# Headers that should be exposed to the Browser (Webapp)
# This is a special header that is used by the
# User Tasks subsystem and should be explicitly
# Enabled when CORS mode is used as part of the
# Admin Interface
webserver.http.cors.exposeheaders=User-Task-ID

# REST API default prefix (dont forget the ending /*)
webserver.api.urlprefix=/kafkacruisecontrol/*

# Location where the Cruise Control frontend is deployed
webserver.ui.diskpath=./cruise-control-ui/dist/

# URL path prefix for UI (dont forget the ending /*)
webserver.ui.urlprefix=/*

# Time After which request is converted to Async
webserver.request.maxBlockTimeMs=10000

# Default Session Expiry Period
webserver.session.maxExpiryTimeMs=60000

# Session cookie path
webserver.session.path=/

# Server Access Logs
webserver.accesslog.enabled=true

# Security
webserver.security.enable=true
webserver.auth.credentials.file=/opt/cruise-control/creds/roles.credentials

# Configurations for servlet
# ==========================

# Enable two-step verification for processing POST requests.
two.step.verification.enabled=false

# The maximum time in milliseconds to retain the requests in two-step (verification) purgatory.
two.step.purgatory.retention.time.ms=1209600000

# The maximum number of requests in two-step (verification) purgatory.
two.step.purgatory.max.requests=25

#Enable Vertx API with Swagger
vertx.enabled=false

# Copyright 2017 LinkedIn Corp. Licensed under the BSD 2-Clause License (the "License"). See License in the project root for license information.
#

rootLogger.level=INFO #here?
appenders=console, kafkaCruiseControlAppender, operationAppender, requestAppender

property.filename=./logs

appender.console.type=Console
appender.console.name=STDOUT
appender.console.layout.type=PatternLayout
appender.console.layout.pattern=[%d] %p %m (%c)%n

appender.kafkaCruiseControlAppender.type=RollingFile
appender.kafkaCruiseControlAppender.name=kafkaCruiseControlFile
appender.kafkaCruiseControlAppender.fileName=${filename}/kafkacruisecontrol.log
appender.kafkaCruiseControlAppender.filePattern=${filename}/kafkacruisecontrol.log.%d{yyyy-MM-dd-HH}
appender.kafkaCruiseControlAppender.layout.type=PatternLayout
appender.kafkaCruiseControlAppender.layout.pattern=[%d] %p %m (%c)%n
appender.kafkaCruiseControlAppender.policies.type=Policies
appender.kafkaCruiseControlAppender.policies.time.type=TimeBasedTriggeringPolicy
appender.kafkaCruiseControlAppender.policies.time.interval=1

appender.operationAppender.type=RollingFile
appender.operationAppender.name=operationFile
appender.operationAppender.fileName=${filename}/kafkacruisecontrol-operation.log
appender.operationAppender.filePattern=${filename}/kafkacruisecontrol-operation.log.%d{yyyy-MM-dd}
appender.operationAppender.layout.type=PatternLayout
appender.operationAppender.layout.pattern=[%d] %p [%c] %m %n
appender.operationAppender.policies.type=Policies
appender.operationAppender.policies.time.type=TimeBasedTriggeringPolicy
appender.operationAppender.policies.time.interval=1

appender.requestAppender.type=RollingFile
appender.requestAppender.name=requestFile
appender.requestAppender.fileName=${filename}/kafkacruisecontrol-request.log
appender.requestAppender.filePattern=${filename}/kafkacruisecontrol-request.log.%d{yyyy-MM-dd-HH}
appender.requestAppender.layout.type=PatternLayout
appender.requestAppender.layout.pattern=[%d] %p %m (%c)%n
appender.requestAppender.policies.type=Policies
appender.requestAppender.policies.time.type=TimeBasedTriggeringPolicy
appender.requestAppender.policies.time.interval=1

# Loggers
logger.cruisecontrol.name=com.linkedin.kafka.cruisecontrol
logger.cruisecontrol.level=warn
logger.cruisecontrol.appenderRef.kafkaCruiseControlAppender.ref=kafkaCruiseControlFile

logger.detector.name=com.linkedin.kafka.cruisecontrol.detector
logger.detector.level=info
logger.detector.appenderRef.kafkaCruiseControlAppender.ref=kafkaCruiseControlFile

logger.operationLogger.name=operationLogger
logger.operationLogger.level=info
logger.operationLogger.appenderRef.operationAppender.ref=operationFile

logger.CruiseControlPublicAccessLogger.name=CruiseControlPublicAccessLogger
logger.CruiseControlPublicAccessLogger.level=info
logger.CruiseControlPublicAccessLogger.appenderRef.requestAppender.ref=requestFile

rootLogger.appenderRefs=console, kafkaCruiseControlAppender
rootLogger.appenderRef.console.ref=STDOUT
rootLogger.appenderRef.kafkaCruiseControlAppender.ref=kafkaCruiseControlFile`

In the end, not EVERYTHING works, but the broker rebalancing works, which is what I got cruise control for.

Make sure you're exporting both your node and application metrics to a prometheus and make sure you have your acls on your kafka for the cruise control to do what it needs to do. good luck

linkedin / cruise-control

Cruise control with MSK #1568

Copyright 2017 LinkedIn Corp. Licensed under the BSD 2-Clause License (the "License"). See License in the project root for license information.