linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

the snapshots are not getting generated #1174

Closed ankurggbpec closed 4 years ago

ankurggbpec commented 4 years ago

ERROR Error processing GET request '/load' due to 'com.linkedin.kafka.cruisecontrol.exception.KafkaCruiseControlException: com.linkedin.cruisecontrol.exception.NotEnoughValidWindowsException: There is no window available in range [-1, 1586332981886]'. (com.linkedin.kafka.cruisecontrol.servlet.KafkaCruiseControlServlet) java.util.concurrent.ExecutionException: com.linkedin.kafka.cruisecontrol.exception.KafkaCruiseControlException: com.linkedin.cruisecontrol.exception.NotEnoughValidWindowsException: There is no window available in range [-1, 1586332981886] at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) at com.linkedin.kafka.cruisecontrol.servlet.KafkaCruiseControlServlet.getAndMaybeReturnProgress(KafkaCruiseControlServlet.java:994) at com.linkedin.kafka.cruisecontrol.servlet.KafkaCruiseControlServlet.getClusterLoad(KafkaCruiseControlServlet.java:487) at com.linkedin.kafka.cruisecontrol.servlet.KafkaCruiseControlServlet.doGet(KafkaCruiseControlServlet.java:190) at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:841) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:535) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:188) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:188) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:168) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:166) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.Server.handle(Server.java:564) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:317) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:110) at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:124) at org.eclipse.jetty.util.thread.Invocable.invokePreferred(Invocable.java:128) at org.eclipse.jetty.util.thread.Invocable$InvocableExecutor.invoke(Invocable.java:222) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:294) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:199) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:673) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:591) at java.lang.Thread.run(Thread.java:745) Caused by: com.linkedin.kafka.cruisecontrol.exception.KafkaCruiseControlException: com.linkedin.cruisecontrol.exception.NotEnoughValidWindowsException: There is no window available in range [-1, 1586332981886] at com.linkedin.kafka.cruisecontrol.KafkaCruiseControl.clusterModel(KafkaCruiseControl.java:376) at com.linkedin.kafka.cruisecontrol.async.GetBrokerStatsRunnable.getResult(GetBrokerStatsRunnable.java:37) at com.linkedin.kafka.cruisecontrol.async.GetBrokerStatsRunnable.getResult(GetBrokerStatsRunnable.java:19) at com.linkedin.kafka.cruisecontrol.async.OperationRunnable.run(OperationRunnable.java:45) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ... 1 more Caused by: com.linkedin.cruisecontrol.exception.NotEnoughValidWindowsException: There is no window available in range [-1, 1586332981886] at com.linkedin.cruisecontrol.monitor.sampling.aggregator.MetricSampleAggregator.aggregate(MetricSampleAggregator.java:197) at com.linkedin.kafka.cruisecontrol.monitor.sampling.aggregator.KafkaPartitionMetricSampleAggregator.aggregate(KafkaPartitionMetricSampleAggregator.java:150) at com.linkedin.kafka.cruisecontrol.monitor.LoadMonitor.clusterModel(LoadMonitor.java:423) at com.linkedin.kafka.cruisecontrol.monitor.LoadMonitor.clusterModel(LoadMonitor.java:390) at com.linkedin.kafka.cruisecontrol.KafkaCruiseControl.clusterModel(KafkaCruiseControl.java:370) ... 8 more [2020-04-08 08:03:01,896] INFO Closing SessionKey SessionKey{_httpSession=org.eclipse.jetty.server.session.Session@33fd5e8c,_requestUrl=GET /kafkacruisecontrol/load,_queryParams={allow_capacity_estimation=[true], json=[true]}} and UserTaskId e448a714-83df-4bf2-94d6-0a728798e7db (com.linkedin.kafka.cruisecontrol.servlet.UserTaskManager) [2020-04-08 08:03:01,896] INFO Invalidate SessionKey SessionKey{_httpSession=org.eclipse.jetty.server.session.Session@33fd5e8c,_requestUrl=GET /kafkacruisecontrol/load,_queryParams={allow_capacity_estimation=[true], json=[true]}} (com.linkedin.kafka.cruisecontrol.servlet.UserTaskManager) [2020-04-08 08:03:01,896] INFO Session node01832dn7yfml1jjx9znftl3cjq1 already being invalidated (org.eclipse.jetty.server.session) [2020-04-08 08:03:03,560] INFO UserTask e448a714-83df-4bf2-94d6-0a728798e7db is complete and removed from active tasks list (com.linkedin.kafka.cruisecontrol.servlet.UserTaskManager) [2020-04-08 08:03:07,786] INFO Received Request%28GET+%2F%2F10.169.149.11%3A9090%2Fkafkacruisecontrol%2Fkafka_cluster_state%3Fjson%3Dtrue%29%401c28460d, http%3A%2F%2F10.169.149.11%3A9090%2Fkafkacruisecontrol%2Fkafka_cluster_state from 172.16.29.200 (CruiseControlPublicAccessLogger) [2020-04-08 08:03:23,662] INFO Skipping best proposal precomputing because load monitor does not have enough snapshots. (com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer) [2020-04-08 08:03:53,663] INFO Skipping best proposal precomputing because load monitor does not have enough snapshots. (com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer) [2020-04-08 08:04:23,567] INFO Kicking off partition metric sampling for time range [1586332943567, 1586333063567], duration 120000 ms using 1 fetchers with timeout 120000 ms. (com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcherManager) [2020-04-08 08:04:23,663] INFO Skipping best proposal precomputing because load monitor does not have enough snapshots. (com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer) [2020-04-08 08:04:28,571] INFO Finished sampling for topic partitions [CruiseControlMetrics-7, CruiseControlMetrics-6, CruiseControlMetrics-5, CruiseControlMetrics-4, CruiseControlMetrics-3, CruiseControlMetrics-2, CruiseControlMetrics-1, CruiseControlMetrics-0] in time range [1586332943567,1586333063567]. Collected 0 metrics. (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsReporterSampler) [2020-04-08 08:04:28,572] INFO Collected 0 partition metric samples for 0 partitions. Total partition assigned: 899. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher) [2020-04-08 08:04:28,572] INFO Collected 0 broker metric samples for 0 brokers. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher) [2020-04-08 08:04:28,572] INFO Finished sampling in 5004 ms. (com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcherManager) [2020-04-08 08:04:53,664] INFO Skipping best proposal precomputing because load monitor does not have enough snapshots. (com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer) [2020-04-08 08:05:23,665] INFO Skipping best proposal precomputing because load monit

ankurggbpec commented 4 years ago

Cruise_control

ankurggbpec commented 4 years ago

The kafka version is 0.10 and we are using cruise control 0.1.10 we tried with cruise control 2x also.

ankurggbpec commented 4 years ago

We are also getting below error:

[2020-04-08 08:59:07,810] ERROR Error building partition metric sample for TOPIC_NAME-0 (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) java.lang.IllegalArgumentException: Broker metric ALL_TOPIC_REPLICATION_BYTES_OUT does not exist. at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor$BrokerLoad.brokerMetric(CruiseControlMetricsProcessor.java:380) at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor$BrokerLoad.access$300(CruiseControlMetricsProcessor.java:302) at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor.buildPartitionMetricSample(CruiseControlMetricsProcessor.java:261) at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor.addPartitionMetricSamples(CruiseControlMetricsProcessor.java:126) at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor.process(CruiseControlMetricsProcessor.java:87) at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsReporterSampler.getSamples(CruiseControlMetricsReporterSampler.java:126) at com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher.fetchSamples(SamplingFetcher.java:105) at com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher.fetchMetricsForAssignedPartitions(SamplingFetcher.java:85) at com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcher.call(MetricFetcher.java:24) at com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcher.call(MetricFetcher.java:16) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) [2020-04-08 08:59:07,810] WARN Skip generating broker metric sample for broker 1002 because the following metrics are missing [ALL_TOPIC_REPLICATION_BYTES_IN, ALL_TOPIC_REPLICATION_BYTES_OUT]. (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) [2020-04-08 08:59:07,810] WARN Skip generating broker metric sample for broker 1001 because the following metrics are missing [ALL_TOPIC_REPLICATION_BYTES_IN, ALL_TOPIC_REPLICATION_BYTES_OUT]. (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) [2020-04-08 08:59:07,810] WARN Skip generating broker metric sample for broker 1003 because the following metrics are missing [ALL_TOPIC_REPLICATION_BYTES_IN, ALL_TOPIC_REPLICATION_BYTES_OUT]. (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) [2020-04-08 08:59:07,810] INFO Generated 0(971 skipped) partition m

jtrenholm-jask commented 4 years ago

I have found that if snapshots aren't being generated it might be due to a broker that isn't generating stats for cruise control. I typically look for this line and will restart the broker when I see it.

[2020-04-08 08:59:07,810] WARN Skip generating broker metric sample for broker 1003 because the following metrics are missing [ALL_TOPIC_REPLICATION_BYTES_IN, ALL_TOPIC_REPLICATION_BYTES_OUT]. (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor)

Try restarting broker 1003 and the others

ankurggbpec commented 4 years ago

I had restarted the broker but still facing the issue.

Also when I am trying to start the cruisecontrol using 2.0 version the matrics starts getting generated. Also even if i stop cruise control and start with .11 version the metrics still getting generated but after some time getting below error:

[2020-04-14 03:42:41,287] ERROR Error building partition metric sample for (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) java.lang.IllegalArgumentException: Broker metric ALL_TOPIC_REPLICATION_BYTES_OUT does not exist.

And this is happening for all the brokers. Though in Cruise Control UI in monitor tab i can see the snapshot as 1

efeg commented 4 years ago

Hi @ankurggbpec Unfortunately, the minimum version of Kafka supported by Cruise Control is 0.11 (please see https://github.com/linkedin/cruise-control#environment-requirements). Would you be able to upgrade your Kafka cluster to 0.10+ (i.e. it is a fairly old version)?

efeg commented 4 years ago

Closing the issue. @ankurggbpec please feel free to reopen if you have further questions.