linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

Cruise Control getting stuck in Bootstrapping #1342

Closed dmarupov closed 3 years ago

dmarupov commented 4 years ago

Hello, I am trying to install Cruise-Control for monitoring our Kafka Environment. I have a distributed Kafka Environment with 3 VMs (Linux OS) and each VM has one Kafka Broker and one Zookeeper in it. So in the cruisecontrol.properties file I have: bootstrap.servers=my-domain-dev01.com:9093,my-domain-dev02.com:9094,my-domain-dev03.com:9095 where each my-domain-dev# is a separate VM. I also have the following for the Zookeeper: zookeeper.connect=my-domain-dev01.com:2183,my-domain-dev02.com:2184,my-domain-dev03.com:2185

At this point I am able to see Kafka Cluster State just fine but when it comes to Metrics I am having the following issue:

[2020-10-08 10:07:53,467] INFO Finished sampling in 254 ms. (com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcherManager)
[2020-10-08 10:07:53,468] INFO Kicking off metric sampling for time range [770800000, 771000000], duration 200000 ms with timeout 200000 ms. (com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcherManager)
[2020-10-08 10:07:53,495] INFO [Consumer clientId=CruiseControlMetricsReporterSampler-8036022001223885340-consumer-833857043, groupId=CruiseControlMetricsReporterSampler-8036022001223885340] Seeking to offset 22102016 for partition __CruiseControlMetrics-0 (org.apache.kafka.clients.consumer.KafkaConsumer)
[2020-10-08 10:07:53,702] INFO Finished sampling for topic partitions [__CruiseControlMetrics-0] in time range [770800000,771000000]. Collected 0 metrics. (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsReporterSampler)
[2020-10-08 10:07:53,702] INFO Collected 0 partition metric samples for 0 partitions. Total partition assigned: 218. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher)
[2020-10-08 10:07:53,702] INFO Collected 0 broker metric samples for 0 brokers. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher)

As you can see it is not collecting any metrics and I noticed that the timestamps are way off: time range [770800000, 771000000] = time range [June 5, 1994 7:06:40 AM, June 7, 1994 2:40:00 PM]

Could this be the issue? Is there a way to fix this? What else could I look into?

I can see records being populated in __CruiseControlMetrics continuously but not __KafkaCruiseControlPartitionMetricSamples or __KafkaCruiseControlPartitionMetricSamples.

I would appreciate any guidance on this.

Thank you.

efeg commented 3 years ago

Hi @dmarupov

As you can see it is not collecting any metrics and I noticed that the timestamps are way off: time range [770800000, 771000000] = time range [June 5, 1994 7:06:40 AM, June 7, 1994 2:40:00 PM]

Could this be the issue? Is there a way to fix this? What else could I look into?

This indeed seems to be the likely root cause. I suspect that the box / VM that the CC instance is running on might have a bad clock. This time range corresponds to the records with timestamps in __CruiseControlMetrics that CC is trying to read. Hence, if there are no records in this topic from 1994, it won't be able to read any records. This also explains why __KafkaCruiseControlPartitionMetricSamples and __KafkaCruiseControlPartitionMetricSamples are empty -- i.e. CC cannot read metrics hence cannot generate samples to back up in these sample store topics.

Can you try running the unix command date on the box / VM that runs CC? Is the returned time accurate?

dmarupov commented 3 years ago

Hi @efeg

Thank you for the response and sorry about the duplicate issue. I did not realize I did that. I ran the date command on my Unix box and I got the correct date and time as shown below:

image

I also ran the same command as non root user and I got the correct date and time. Is there anything in the CruiseControl Configs that would make it read records from 1994?

Thank you.

efeg commented 3 years ago

@dmarupov This is a little weird. There should not be a config that would make CC read records from '94. Can you reproduce this locally on a local CC deployment and local brokers (e.g. 2 brokers)? Can you also share your CC version?

jrevillard commented 3 years ago

I have the exact same issue on a very new infrastructure (zookeeper, Kafka, cc):

cruise-control_1  | 6165644 [MetricFetcher-0] INFO  nitor.sampling.SamplingFetcher  - Collected 0 partition metric samples for 0 partitions. Total partition assigned: 65.
cruise-control_1  | 6165644 [MetricFetcher-0] INFO  nitor.sampling.SamplingFetcher  - Collected 0 broker metric samples for 0 brokers.
cruise-control_1  | 6165644 [lingScheduler-1] INFO  .sampling.MetricFetcherManager  - Finished sampling in 531 ms.
cruise-control_1  | 6165644 [lingScheduler-1] INFO  .sampling.MetricFetcherManager  - Kicking off metric sampling for time range [986160000, 986280000], duration 120000 ms with timeout 120000 ms.
cruise-control_1  | 6166143 [MetricFetcher-0] INFO  clients.consumer.KafkaConsumer  - [Consumer clientId=CruiseControlMetricsReporterSampler-consumer--8160475407610004265, groupId=null] Seeking to offset 0 for partition __CruiseControlMetrics-0
cruise-control_1  | 6166181 [MetricFetcher-0] INFO  eControlMetricsReporterSampler  - Finished sampling for topic partitions [__CruiseControlMetrics-0] in time range [986160000,986280000]. Collected 0 metrics.

my CC version is 2.5.27

jrevillard commented 3 years ago

For info, after restart, (updated CC version to v2.5.28) it did it again when I clicked on "Boostrap":

cruise-control_1  | 299586 [qtp775741122-62] INFO  ler.async.AbstractAsyncRequest  - Processing sync request BootstrapRequest.
cruise-control_1  | 299595 [lingScheduler-1] INFO  rol.monitor.task.BootstrapTask  - Load monitor is bootstrapping since 0
cruise-control_1  | 299603 [lingScheduler-1] INFO  .sampling.MetricFetcherManager  - Kicking off metric sampling for time range [0, 120000], duration 120000 ms with timeout 120000 ms.
cruise-control_1  | 299610 [MetricFetcher-0] INFO  clients.consumer.KafkaConsumer  - [Consumer clientId=CruiseControlMetricsReporterSampler-consumer-2810727836355973195, groupId=null] Seeking to offset 0 for partition __CruiseControlMetrics-0
cruise-control_1  | 299661 [omalyDetector-2] INFO  .detector.AnomalyDetectorUtils  - Skipping anomaly detection because load monitor is in BOOTSTRAPPING state.
cruise-control_1  | 299968 [MetricFetcher-0] INFO  eControlMetricsReporterSampler  - Finished sampling for topic partitions [__CruiseControlMetrics-0] in time range [0,120000]. Collected 0 metrics.
cruise-control_1  | 299968 [MetricFetcher-0] INFO  nitor.sampling.SamplingFetcher  - Collected 0 partition metric samples for 0 partitions. Total partition assigned: 65.
cruise-control_1  | 299968 [MetricFetcher-0] INFO  nitor.sampling.SamplingFetcher  - Collected 0 broker metric samples for 0 brokers.
cruise-control_1  | 299969 [lingScheduler-1] INFO  .sampling.MetricFetcherManager  - Finished sampling in 366 ms.
cruise-control_1  | 299969 [lingScheduler-1] INFO  .sampling.MetricFetcherManager  - Kicking off metric sampling for time range [120000, 240000], duration 120000 ms with timeout 120000 ms.
cruise-control_1  | 299992 [MetricFetcher-0] INFO  clients.consumer.KafkaConsumer  - [Consumer clientId=CruiseControlMetricsReporterSampler-consumer-2810727836355973195, groupId=null] Seeking to offset 0 for partition __CruiseControlMetrics-0

Again, after resart.. everything goes well...

efeg commented 3 years ago

For info, after restart, (updated CC version to v2.5.28) it did it again when I clicked on "Boostrap":

@jrevillard What does it mean to click on Bootstrap? Are you using the bootstrap endpoint of Cruise Control (CC)? If so, this endpoint is used only for development purposes and is not really meant to be used for bootstrapping a CC instance. When CC starts, it automatically bootstraps w/o the need for any extra call.

jrevillard commented 3 years ago

@efeg ok, I was using the CC UI Bootstrap "Metric button". I wasn't aware that this is the normal behavior.

Thanks