linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

Unable to configure and run cruisecontrol perfectly. #1607

Open rahu7624 opened 3 years ago

rahu7624 commented 3 years ago

Hi Team ,

We are getting below errors while checking cruisecontrol status , can you please check and suggest.

[root@kafka-0 ~]# systemctl status cruisecontrol -l ● cruisecontrol.service - Zookeeper Loaded: loaded (/etc/systemd/system/cruisecontrol.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2021-07-01 14:08:30 UTC; 3min 59s ago Main PID: 29352 (cc.sh) CGroup: /system.slice/cruisecontrol.service ├─29352 /bin/bash /usr/local/bin/cc.sh └─29354 java -Xmx1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+DisableExplicitGC -Djava.awt.headless=true -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=./logs -Dlog4j.configurationFile=file:./config/log4j.properties -cp ./cruise-control/build/dependant-libs/:./cruise-control/build/libs/:./cruise-control-metrics-reporter/build/libs/* com.linkedin.kafka.cruisecontrol.KafkaCruiseControlMain config/cruisecontrol.properties

Jul 01 14:11:41 kafka-0 cc.sh[29352]: [2021-07-01 14:11:41,861] WARN Skipping goal violation detection for ReplicaCapacityGoal because load completeness requirement is not met. (com.linkedin.kafka.cruisecontrol.detector.GoalViolationDetector) Jul 01 14:11:41 kafka-0 cc.sh[29352]: [2021-07-01 14:11:41,861] WARN Skipping goal violation detection for DiskCapacityGoal because load completeness requirement is not met. (com.linkedin.kafka.cruisecontrol.detector.GoalViolationDetector) Jul 01 14:11:41 kafka-0 cc.sh[29352]: [2021-07-01 14:11:41,861] WARN Skipping goal violation detection for NetworkInboundCapacityGoal because load completeness requirement is not met. (com.linkedin.kafka.cruisecontrol.detector.GoalViolationDetector) Jul 01 14:11:41 kafka-0 cc.sh[29352]: [2021-07-01 14:11:41,862] WARN Skipping goal violation detection for NetworkOutboundCapacityGoal because load completeness requirement is not met. (com.linkedin.kafka.cruisecontrol.detector.GoalViolationDetector) Jul 01 14:11:41 kafka-0 cc.sh[29352]: [2021-07-01 14:11:41,862] WARN Skipping goal violation detection for CpuCapacityGoal because load completeness requirement is not met. (com.linkedin.kafka.cruisecontrol.detector.GoalViolationDetector) Jul 01 14:11:45 kafka-0 cc.sh[29352]: [2021-07-01 14:11:45,398] INFO Start to detect topic replication factor anomaly. (com.linkedin.kafka.cruisecontrol.detector.TopicAnomalyFinder) Jul 01 14:11:45 kafka-0 cc.sh[29352]: [2021-07-01 14:11:45,399] WARN TOPIC_ANOMALY detected {Topics with replication factor violations: [{With desired RF 2: [{test(100.00)}]}]}. Self healing start time 2021-07-01T14:11:45Z. (com.linkedin.kafka.cruisecontrol.detector.notifier.SelfHealingNotifier) Jul 01 14:11:45 kafka-0 cc.sh[29352]: [2021-07-01 14:11:45,400] WARN Self-healing has been triggered. (com.linkedin.kafka.cruisecontrol.detector.notifier.SelfHealingNotifier) Jul 01 14:11:45 kafka-0 cc.sh[29352]: [2021-07-01 14:11:45,472] WARN Skipping TOPIC_ANOMALY fix because load completeness requirement is not met for goals. (com.linkedin.kafka.cruisecontrol.detector.AnomalyDetectorManager) Jul 01 14:12:11 kafka-0 cc.sh[29352]: [2021-07-01 14:12:11,598] INFO Skipping proposal precomputing because load monitor does not have enough snapshots. (com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer) [root@kafka-0 ~]#

[root@kafka-0 kafka]# curl 'http://localhost:9090/kafkacruisecontrol/state' MonitorState: {state: RUNNING(0.000% trained), NumValidWindows: (0/0) (NaN%), NumValidPartitions: 0/0 (0.000%), flawedPartitions: 0} ExecutorState: {state: NO_TASK_IN_PROGRESS} AnalyzerState: {isProposalReady: false, readyGoals: []} AnomalyDetectorState: {selfHealingEnabled:[BROKER_FAILURE, DISK_FAILURE, METRIC_ANOMALY, GOAL_VIOLATION, TOPIC_ANOMALY, MAINTENANCE_EVENT], selfHealingDisabled:[], selfHealingEnabledRatio:{BROKER_FAILURE=1.0, DISK_FAILURE=1.0, METRIC_ANOMALY=1.0, GOAL_VIOLATION=1.0, TOPIC_ANOMALY=1.0, MAINTENANCE_EVENT=1.0}, recentGoalViolations:[], recentBrokerFailures:[], recentMetricAnomalies:[], recentDiskFailures:[], recentTopicAnomalies:[{description={Topics with replication factor violations: [{With desired RF 2: [{test(100.00)}]}]}, anomalyId=c3044efe-1176-461e-bd21-9b16418bc815, detectionDate=2021-07-01T14:11:45Z, status=COMPLETENESS_NOT_READY, statusUpdateDate=2021-07-01T14:11:45Z}, {description={Topics with replication factor violations: [{With desired RF 2: [{test(100.00)}]}]}, anomalyId=20958eec-b7fa-4fc4-8c6a-38f000a20b09, detectionDate=2021-07-01T14:09:45Z, status=COMPLETENESS_NOT_READY, statusUpdateDate=2021-07-01T14:09:45Z}, {description={Topics with replication factor violations: [{With desired RF 2: [{test(100.00)}]}]}, anomalyId=a61584a9-0d44-472c-b2b1-b8740a3c6ced, detectionDate=2021-07-01T14:13:45Z, status=COMPLETENESS_NOT_READY, statusUpdateDate=2021-07-01T14:13:45Z}, {description={Topics with replication factor violations: [{With desired RF 2: [{test(100.00)}]}]}, anomalyId=2e5612d6-0c3c-4e38-a478-ca06b7eeb265, detectionDate=2021-07-01T14:15:45Z, status=COMPLETENESS_NOT_READY, statusUpdateDate=2021-07-01T14:15:45Z}], recentMaintenanceEvents:[], metrics:{meanTimeBetweenAnomalies:{GOAL_VIOLATION:0.00 milliseconds, BROKER_FAILURE:0.00 milliseconds, METRIC_ANOMALY:0.00 milliseconds, DISK_FAILURE:0.00 milliseconds, TOPIC_ANOMALY:8.88 milliseconds}, meanTimeToStartFix:0.00 milliseconds, numSelfHealingStarted:0, numSelfHealingFailedToStart:0, ongoingAnomalyDuration=6.31 minutes}, ongoingSelfHealingAnomaly:None, balancednessScore:100.000}

[root@kafka-0 kafka]#

efeg commented 3 years ago

COMPLETENESS_NOT_READY means that Cruise Control (CC) was unable to collect sufficient samples from Kafka to generate a cluster model on which it will operate to perform goal-based cluster maintenance operations. This could be due to either of the two (1) you have just started CC, so it hasn't had time to collect samples, yet (give it some time and see if CC logs shows that it was able to collect samples a new window is rolled) (2) there is a problem in collecting samples from Kafka. Can you verify that you configured metrics reporter correctly on Kafka-side? Did you follow the quick-start tutorial on CC Github page to setup metrics reporter? Does your metrics reporter topic get any data from Kafka?

rahu7624 commented 3 years ago

Hi Adem ,

Thanks for looking into it , its a test setup having 3 nodes with just one test topic and currently no data flowing in/out. I simply referred quick-start tutorial and configured the same way on all 3 nodes. Kindly refer Kafka side configs for the same and let us know if any changes are required.

[rahul@kafka-0 ~]$ cat /usr/local/share/kafka/config/server.properties | grep -i cruise metric.reporters=com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter cruise.control.metrics.topic.auto.create=true cruise.control.metrics.topic.num.partitions=1 cruise.control.metrics.topic.replication.factor=1

Thanks in advance.

rahu7624 commented 3 years ago

However situation is still the same even after 18 hours.

[rahul@kafka-0 ~]$ curl -X GET "http://localhost:9090/kafkacruisecontrol/state" MonitorState: {state: RUNNING(0.000% trained), NumValidWindows: (0/0) (NaN%), NumValidPartitions: 0/0 (0.000%), flawedPartitions: 0} ExecutorState: {state: NO_TASK_IN_PROGRESS} AnalyzerState: {isProposalReady: false, readyGoals: []} AnomalyDetectorState: {selfHealingEnabled:[BROKER_FAILURE, DISK_FAILURE, METRIC_ANOMALY, GOAL_VIOLATION, TOPIC_ANOMALY, MAINTENANCE_EVENT], selfHealingDisabled:[], selfHealingEnabledRatio:{BROKER_FAILURE=1.0, DISK_FAILURE=1.0, METRIC_ANOMALY=1.0, GOAL_VIOLATION=1.0, TOPIC_ANOMALY=1.0, MAINTENANCE_EVENT=1.0}, recentGoalViolations:[], recentBrokerFailures:[], recentMetricAnomalies:[], recentDiskFailures:[], recentTopicAnomalies:[{description={Topics with replication factor violations: [{With desired RF 2: [{test(100.00)}]}]}, anomalyId=b5852ac0-9ce9-4721-81bd-a6d89df6e7f5, detectionDate=2021-07-02T08:19:45Z, status=COMPLETENESS_NOT_READY, statusUpdateDate=2021-07-02T08:19:45Z}, {description={Topics with replication factor violations: [{With desired RF 2: [{test(100.00)}]}]}, anomalyId=297d6fdd-ee77-4375-b787-f3e8fa39996b, detectionDate=2021-07-02T08:23:45Z, status=COMPLETENESS_NOT_READY, statusUpdateDate=2021-07-02T08:23:45Z}, {description={Topics with replication factor violations: [{With desired RF 2: [{test(100.00)}]}]}, anomalyId=b6d70146-0c31-45a5-9d7e-f8f4f9c1c4a1, detectionDate=2021-07-02T08:27:45Z, status=COMPLETENESS_NOT_READY, statusUpdateDate=2021-07-02T08:27:45Z}, {description={Topics with replication factor violations: [{With desired RF 2: [{test(100.00)}]}]}, anomalyId=f7344f75-1ccb-4215-8ae7-e0ca9347f2da, detectionDate=2021-07-02T08:35:45Z, status=COMPLETENESS_NOT_READY, statusUpdateDate=2021-07-02T08:35:45Z}, {description={Topics with replication factor violations: [{With desired RF 2: [{test(100.00)}]}]}, anomalyId=7d61a3b9-e056-47bc-89a5-7a69fb4e414a, detectionDate=2021-07-02T08:21:45Z, status=COMPLETENESS_NOT_READY, statusUpdateDate=2021-07-02T08:21:45Z}, {description={Topics with replication factor violations: [{With desired RF 2: [{test(100.00)}]}]}, anomalyId=d53e99a2-102c-452d-b3fd-c13741a4241c, detectionDate=2021-07-02T08:31:45Z, status=COMPLETENESS_NOT_READY, statusUpdateDate=2021-07-02T08:31:45Z}, {description={Topics with replication factor violations: [{With desired RF 2: [{test(100.00)}]}]}, anomalyId=517c91d0-9c10-4c8b-9666-ea95c2ebb490, detectionDate=2021-07-02T08:25:45Z, status=COMPLETENESS_NOT_READY, statusUpdateDate=2021-07-02T08:25:45Z}, {description={Topics with replication factor violations: [{With desired RF 2: [{test(100.00)}]}]}, anomalyId=33e18a5e-d05f-46ee-b63e-51dfa3ba44e3, detectionDate=2021-07-02T08:29:45Z, status=COMPLETENESS_NOT_READY, statusUpdateDate=2021-07-02T08:29:45Z}, {description={Topics with replication factor violations: [{With desired RF 2: [{test(100.00)}]}]}, anomalyId=0fb9a33e-95c1-442f-8aa3-b8eff0814315, detectionDate=2021-07-02T08:37:45Z, status=COMPLETENESS_NOT_READY, statusUpdateDate=2021-07-02T08:37:45Z}, {description={Topics with replication factor violations: [{With desired RF 2: [{test(100.00)}]}]}, anomalyId=980cd90d-0f36-4463-988a-f6da8c68df09, detectionDate=2021-07-02T08:33:45Z, status=COMPLETENESS_NOT_READY, statusUpdateDate=2021-07-02T08:33:45Z}], recentMaintenanceEvents:[], metrics:{meanTimeBetweenAnomalies:{GOAL_VIOLATION:0.00 milliseconds, BROKER_FAILURE:0.00 milliseconds, METRIC_ANOMALY:0.00 milliseconds, DISK_FAILURE:0.00 milliseconds, TOPIC_ANOMALY:8.33 milliseconds}, meanTimeToStartFix:0.00 milliseconds, numSelfHealingStarted:0, numSelfHealingFailedToStart:0, ongoingAnomalyDuration=18.49 hours}, ongoingSelfHealingAnomaly:None, balancednessScore:100.000}

[rahul@kafka-0 ~]$

efeg commented 3 years ago

@rahu7624 Do you see any data going into the __CruiseControlMetrics topic -- i.e. does it grow in size? If not, this is an issue with the Kafka-side configs. Here is a checklist that might help:

rahu7624 commented 3 years ago

Tried reconfiguring it the way you advised , seems it started collecting some metrices but still giving some errors.

[root@kafka-2 kafka]# systemctl status cruisecontrol -l ● cruisecontrol.service - Zookeeper Loaded: loaded (/etc/systemd/system/cruisecontrol.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2021-07-05 10:46:46 UTC; 44min ago Main PID: 13241 (cc.sh) CGroup: /system.slice/cruisecontrol.service ├─13241 /bin/bash /usr/local/bin/cc.sh └─13243 java -Xmx1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+DisableExplicitGC -Djava.awt.headless=true -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=./logs -Dlog4j.configurationFile=file:./config/log4j.properties -cp ./cruise-control/build/dependant-libs/:./cruise-control/build/libs/:./cruise-control-metrics-reporter/build/libs/* com.linkedin.kafka.cruisecontrol.KafkaCruiseControlMain config/cruisecontrol.properties

Jul 05 11:30:57 kafka-2 cc.sh[13241]: [2021-07-05 11:30:57,248] INFO Finished sampling from topic CruiseControlMetrics for partitions [0] in time range [1625484537241,1625484657241]. Collected 526 metrics. (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsReporterSampler) Jul 05 11:30:57 kafka-2 cc.sh[13241]: [2021-07-05 11:30:57,248] WARN Broker 2 is missing 4/4 topics metrics and 39/39 leader partition metrics. Missing leader topics: [KafkaCruiseControlPartitionMetricSamples, test, KafkaCruiseControlModelTrainingSamples, consumer_offsets]. (com.linkedin.kafka.cruisecontrol.monitor.sampling.holder.BrokerLoad) Jul 05 11:30:57 kafka-2 cc.sh[13241]: [2021-07-05 11:30:57,248] WARN Skip generating metric sample for broker 2 because the following required metrics are missing [BROKER_PRODUCE_LOCAL_TIME_MS_MAX, BROKER_PRODUCE_REQUEST_QUEUE_TIME_MS_MEAN, BROKER_FOLLOWER_FETCH_LOCAL_TIME_MS_MEAN, ALL_TOPIC_PRODUCE_REQUEST_RATE, ALL_TOPIC_MESSAGES_IN_PER_SEC, BROKER_PRODUCE_TOTAL_TIME_MS_MEAN, ALL_TOPIC_FETCH_REQUEST_RATE, BROKER_FOLLOWER_FETCH_REQUEST_RATE, ALL_TOPIC_REPLICATION_BYTES_OUT, BROKER_PRODUCE_TOTAL_TIME_MS_MAX, ALL_TOPIC_REPLICATION_BYTES_IN, BROKER_CONSUMER_FETCH_REQUEST_QUEUE_TIME_MS_MAX, BROKER_FOLLOWER_FETCH_REQUEST_QUEUE_TIME_MS_MAX, BROKER_CONSUMER_FETCH_LOCAL_TIME_MS_MAX, ALL_TOPIC_BYTES_IN, BROKER_FOLLOWER_FETCH_TOTAL_TIME_MS_MAX, BROKER_CONSUMER_FETCH_REQUEST_QUEUE_TIME_MS_MEAN, BROKER_PRODUCE_REQUEST_QUEUE_TIME_MS_MAX, BROKER_FOLLOWER_FETCH_TOTAL_TIME_MS_MEAN, BROKER_CONSUMER_FETCH_LOCAL_TIME_MS_MEAN, ALL_TOPIC_BYTES_OUT, BROKER_CONSUMER_FETCH_TOTAL_TIME_MS_MEAN, BROKER_REQUEST_QUEUE_SIZE, BROKER_CONSUMER_FETCH_TOTAL_TIME_MS_MAX, BROKER_RESPONSE_QUEUE_SIZE, BROKER_PRODUCE_LOCAL_TIME_MS_MEAN, BROKER_REQUEST_HANDLER_AVG_IDLE_PERCENT, BROKER_FOLLOWER_FETCH_LOCAL_TIME_MS_MAX, BROKER_FOLLOWER_FETCH_REQUEST_QUEUE_TIME_MS_MEAN]. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingUtils) Jul 05 11:30:57 kafka-2 cc.sh[13241]: [2021-07-05 11:30:57,249] INFO Generated 79(39 skipped by broker {2=39}) partition metric samples and 2(1 skipped) broker metric samples for timestamp 1625484656792. (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) Jul 05 11:30:57 kafka-2 cc.sh[13241]: [2021-07-05 11:30:57,249] INFO PARTITION Aggregator rolled out 1 new windows, reset 1 windows, current window range [1625484600000, 1625484900000], abandon 237 samples. (com.linkedin.cruisecontrol.monitor.sampling.aggregator.MetricSampleAggregator) Jul 05 11:30:57 kafka-2 cc.sh[13241]: [2021-07-05 11:30:57,249] INFO Collected 79 partition metric samples for 79 partitions. Total partition assigned: 118. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher) Jul 05 11:30:57 kafka-2 cc.sh[13241]: [2021-07-05 11:30:57,249] INFO BROKER Aggregator rolled out 1 new windows, reset 1 windows, current window range [1625478900000, 1625484900000], abandon 0 samples. (com.linkedin.cruisecontrol.monitor.sampling.aggregator.MetricSampleAggregator) Jul 05 11:30:57 kafka-2 cc.sh[13241]: [2021-07-05 11:30:57,255] INFO Collected 2 broker metric samples for 2 brokers. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher) Jul 05 11:30:57 kafka-2 cc.sh[13241]: [2021-07-05 11:30:57,267] INFO Finished sampling in 26 ms. (com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcherManager) Jul 05 11:30:58 kafka-2 cc.sh[13241]: [2021-07-05 11:30:58,408] INFO Skipping proposal precomputing because load monitor does not have enough snapshots. (com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer) [root@kafka-2 kafka]#

rahu7624 commented 3 years ago

Also it show RF anomaly for cruisecontrol topics.

[root@kafka-2 kafka]# curl 'http://localhost:9090/kafkacruisecontrol/state' MonitorState: {state: RUNNING(11.600% trained), NumValidWindows: (0/1) (0.000%), NumValidPartitions: 79/118 (66.949%), flawedPartitions: 0} ExecutorState: {state: NO_TASK_IN_PROGRESS} AnalyzerState: {isProposalReady: false, readyGoals: [ReplicaDistributionGoal, RackAwareGoal, TopicReplicaDistributionGoal, LeaderReplicaDistributionGoal, ReplicaCapacityGoal]} AnomalyDetectorState: {selfHealingEnabled:[BROKER_FAILURE, DISK_FAILURE, GOAL_VIOLATION, METRIC_ANOMALY, TOPIC_ANOMALY, MAINTENANCE_EVENT], selfHealingDisabled:[], selfHealingEnabledRatio:{BROKER_FAILURE=1.0, DISK_FAILURE=1.0, GOAL_VIOLATION=1.0, METRIC_ANOMALY=1.0, TOPIC_ANOMALY=1.0, MAINTENANCE_EVENT=1.0}, recentGoalViolations:[], recentBrokerFailures:[], recentMetricAnomalies:[], recentDiskFailures:[], recentTopicAnomalies:[{description={Topics with replication factor violations: [{With desired RF 3: [{KafkaCruiseControlModelTrainingSamples(100.00)}, {CruiseControlMetrics(100.00)}, {consumer_offsets(100.00)}, {KafkaCruiseControlPartitionMetricSamples(100.00)}]}]}, anomalyId=e8ee6abe-cfb5-42c7-9daa-1e9293e49692, detectionDate=2021-07-05T11:20:04Z, status=COMPLETENESS_NOT_READY, statusUpdateDate=2021-07-05T11:20:04Z}, {description={Topics with replication factor violations: [{With desired RF 3: [{KafkaCruiseControlModelTrainingSamples(100.00)}, {__consumer_offsets(100.00)}, {CruiseControlMetrics(100.00)}, {KafkaCruiseControlPartitionMetricSamples(100.00)}]}]}, anomalyId=cb615220-3a97-4670-a320-5a7e66612879, detectionDate=2021-07-05T11:16:04Z, status=COMPLETENESS_NOT_READY, statusUpdateDate=2021-07-05T11:16:04Z}, {description={Topics with replication factor violations: [{With desired RF 3: [{CruiseControlMetrics(100.00)}, {KafkaCruiseControlPartitionMetricSamples(100.00)}, {KafkaCruiseControlModelTrainingSamples(100.00)}, {consumer_offsets(100.00)}]}]}, anomalyId=35dc9570-8750-4baa-a2f3-4c2c641b51e0, detectionDate=2021-07-05T11:32:04Z, status=COMPLETENESS_NOT_READY, statusUpdateDate=2021-07-05T11:32:04Z}, {description={Topics with replication factor violations: [{With desired RF 3: [{CruiseControlMetrics(100.00)}, {KafkaCruiseControlModelTrainingSamples(100.00)}, {KafkaCruiseControlPartitionMetricSamples(100.00)}, {consumer_offsets(100.00)}]}]}, anomalyId=3320092c-2c2b-471e-949f-f7137b580de4, detectionDate=2021-07-05T11:28:04Z, status=COMPLETENESS_NOT_READY, statusUpdateDate=2021-07-05T11:28:04Z}, {description={Topics with replication factor violations: [{With desired RF 3: [{CruiseControlMetrics(100.00)}, {KafkaCruiseControlPartitionMetricSamples(100.00)}, {KafkaCruiseControlModelTrainingSamples(100.00)}, {consumer_offsets(100.00)}]}]}, anomalyId=07ac8022-e5cf-4d1c-9a99-2f47cfa8b476, detectionDate=2021-07-05T11:30:04Z, status=COMPLETENESS_NOT_READY, statusUpdateDate=2021-07-05T11:30:04Z}, {description={Topics with replication factor violations: [{With desired RF 3: [{__consumer_offsets(100.00)}, {KafkaCruiseControlModelTrainingSamples(100.00)}, {CruiseControlMetrics(100.00)}, {KafkaCruiseControlPartitionMetricSamples(100.00)}]}]}, anomalyId=28339b49-77db-4ed3-9ba2-31920954b398, detectionDate=2021-07-05T11:34:04Z, status=COMPLETENESS_NOT_READY, statusUpdateDate=2021-07-05T11:34:04Z}, {description={Topics with replication factor violations: [{With desired RF 3: [{CruiseControlMetrics(100.00)}, {KafkaCruiseControlModelTrainingSamples(100.00)}, {consumer_offsets(100.00)}, {KafkaCruiseControlPartitionMetricSamples(100.00)}]}]}, anomalyId=05bda5cc-693e-4072-a91b-05294bbb5e58, detectionDate=2021-07-05T11:22:04Z, status=COMPLETENESS_NOT_READY, statusUpdateDate=2021-07-05T11:22:04Z}, {description={Topics with replication factor violations: [{With desired RF 3: [{KafkaCruiseControlModelTrainingSamples(100.00)}, {KafkaCruiseControlPartitionMetricSamples(100.00)}, {CruiseControlMetrics(100.00)}, {__consumer_offsets(100.00)}]}]}, anomalyId=334304d1-1181-4702-827e-ff37a91cd436, detectionDate=2021-07-05T11:18:04Z, status=COMPLETENESS_NOT_READY, statusUpdateDate=2021-07-05T11:18:04Z}, {description={Topics with replication factor violations: [{With desired RF 3: [{consumer_offsets(100.00)}, {CruiseControlMetrics(100.00)}, {KafkaCruiseControlPartitionMetricSamples(100.00)}, {KafkaCruiseControlModelTrainingSamples(100.00)}]}]}, anomalyId=1be80a1d-5124-494c-81e3-ed4c038991aa, detectionDate=2021-07-05T11:24:04Z, status=COMPLETENESS_NOT_READY, statusUpdateDate=2021-07-05T11:24:04Z}, {description={Topics with replication factor violations: [{With desired RF 3: [{KafkaCruiseControlPartitionMetricSamples(100.00)}, {CruiseControlMetrics(100.00)}, {KafkaCruiseControlModelTrainingSamples(100.00)}, {__consumer_offsets(100.00)}]}]}, anomalyId=b82cd436-d121-4bb1-ac12-1d907939c92a, detectionDate=2021-07-05T11:26:04Z, status=COMPLETENESS_NOT_READY, statusUpdateDate=2021-07-05T11:26:04Z}], recentMaintenanceEvents:[], metrics:{meanTimeBetweenAnomalies:{GOAL_VIOLATION:0.00 milliseconds, BROKER_FAILURE:0.00 milliseconds, METRIC_ANOMALY:0.00 milliseconds, DISK_FAILURE:0.00 milliseconds, TOPIC_ANOMALY:8.29 milliseconds}, meanTimeToStartFix:0.00 milliseconds, numSelfHealingStarted:0, numSelfHealingFailedToStart:0, ongoingAnomalyDuration=47.01 minutes}, ongoingSelfHealingAnomaly:None, balancednessScore:100.000}

[root@kafka-2 kafka]#

efeg commented 3 years ago

WARN Broker 2 is missing 4/4 topics metrics and 39/39 leader partition metrics. Missing leader topics: [KafkaCruiseControlPartitionMetricSamples, test, KafkaCruiseControlModelTrainingSamples, __consumer_offsets].

and then

INFO Generated 79(39 skipped by broker {2=39}) partition metric samples and 2(1 skipped) broker metric samples for timestamp 1625484656792.

implies that broker 2 was not configured properly. If broker 2 is configured later, then eventually CC will be able to collect samples from all brokers and will roll out a window -- i.e. MonitorState will show NumValidWindows: (1/1).

Also it show RF anomaly for cruisecontrol topics.

This is independent of the issue we discussed above. It says that "desired replication factor" config is set to 3, but the listed topics have an RF different from the desired RF. You can set the desired replication factor in a cluster using self.healing.target.topic.replication.factor config.