linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

Multiple Warning and Error messages in the logs #502

Closed saritago closed 5 years ago

saritago commented 5 years ago

Hi,

i am seeing multiple warning and error messages related to timeouts in the logs. Also the get commands for state are going in queue state.

I initially had metric.sampling.interval.ms value set to 300000, after this seeing warning i set it to 500000 but i still see these messages.

[2019-01-25 06:10:00,102] WARN Sampling did not finish in 500000 ms, skipping this sampling interval. (com.linkedin.kafka.cruisecontrol.monitor.task.SamplingTask)

In addition to these warning i aslo see below error messages here and there

[2019-01-25 06:16:32,212] ERROR Received exception. (com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcher)
org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by times in 305000 ms
[2019-01-25 06:18:20,112] ERROR Sampling scheduler received Unknown exception when waiting for sampling to finish (com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcherManager)
java.util.concurrent.TimeoutException
    at java.util.concurrent.FutureTask.get(FutureTask.java:205)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcherManager.fetchSamples(MetricFetcherManager.java:251)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcherManager.fetchPartitionMetricSamples(MetricFetcherManager.java:199)
    at com.linkedin.kafka.cruisecontrol.monitor.task.SamplingTask.run(SamplingTask.java:56)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
[2019-01-25 06:18:20,113] INFO Finished sampling in 107890 ms. (com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcherManager)
[2019-01-25 06:18:20,114] WARN Sampling did not finish in 500000 ms, skipping this sampling interval. (com.linkedin.kafka.cruisecontrol.monitor.task.SamplingTask)
[2019-01-25 06:21:37,384] ERROR Received exception. (com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcher)
org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by times in 305000 ms
efeg commented 5 years ago

Looks like you are using a version of Cruise Control before PR-https://github.com/linkedin/cruise-control/pull/409. I assume you also observe that your executor substate is stuck at STARTING_EXECUTION -- you should verify this using state endpoint with substates=executor parameter.

To pick up the relevant fix, use Cruise Control versions:

saritago commented 5 years ago

But i have the same version running on several other clusters and they work just fine. Does it have anything to do with the volume of data on kafka clusters?

kidkun commented 5 years ago

from the Exception you pasted, looks like Cruise Control is unable to get the offset correspond to certain timestamp for the metric topic __CruiseControlMetrics. can you first confirm that this topic exist in the cluster and reporters are properly producing message to it?

Another thing is that, as Efe mentioned, if it is a transient thing, the newer version will make sure sampling failure do not block operation from execution.

efeg commented 5 years ago

Closing the issue, as the https://github.com/linkedin/cruise-control/issues/502#issuecomment-457866690 provides the solution to a known issue. As discussed in the Gitter channel, this is a concurrency bug; hence, it is possible that you haven't observed this behavior on other clusters so far.

saritago commented 5 years ago

@efeg @kidkun I have downloaded the latest CC code and yet seeing the warnings as

[2019-02-26 10:18:18,823] WARN Encountered error when loading sample from Kafka. (com.linkedin.kafka.cruisecontrol.monitor.sampling.KafkaSampleStore)
org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by times in 50000 ms
[2019-02-26 10:27:26,226] ERROR Sampling scheduler received Unknown exception when waiting for sampling to finish (com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcherManager)
java.util.concurrent.TimeoutException
[2019-02-26 10:27:26,227] WARN Sampling did not finish in 300000 ms, skipping this sampling interval. (com.linkedin.kafka.cruisecontrol.monitor.task.SamplingTask)
saritago commented 5 years ago

Please ignore above ping, it was my mistake, i was using older client.