linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

Cruise Control stuck in Bootstrapping infinitely when fetching Metrics from __CruiseControlMetrics #1222

Closed rohit-kulk closed 4 years ago

rohit-kulk commented 4 years ago

I am using Cruise Control 2.4.0 with Kafka 2.4.1

In Cruise Control UI, under Tab - Cruise Control Stage -> Monitor State The Monitor will be stuck on 'Bootstrapping' infinitely

The only way to fix is to restart Cruise Control. After restarting Cruise Control, it begins again from the Latest offset for Metrics.

In Cruise Control Logs, this is running in loop during Bootstrapping:

[2020-06-01 17:51:43,481] INFO Collected 0 broker metric samples for 0 brokers. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher)
[2020-06-01 17:51:43,481] INFO Finished sampling in 200 ms. (com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcherManager)
[2020-06-01 17:51:43,481] INFO Kicking off partition metric sampling for time range [147734280000, 147734400000], duration 120000 ms with timeout 120000 ms. (com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcherManager)
[2020-06-01 17:51:43,499] INFO [Consumer clientId=CruiseControlMetricsReporterSampler--2769550922101085340-consumer-1323935322, groupId=CruiseControlMetricsReporterSampler--2769550922101085340] Seeking to offset 0 for partition __CruiseControlMetrics-0 (org.apache.kafka.clients.consumer.KafkaConsumer)
[2020-06-01 17:51:43,499] INFO [Consumer clientId=CruiseControlMetricsReporterSampler--2769550922101085340-consumer-1323935322, groupId=CruiseControlMetricsReporterSampler--2769550922101085340] Seeking to offset 0 for partition __CruiseControlMetrics-2 (org.apache.kafka.clients.consumer.KafkaConsumer)
[2020-06-01 17:51:43,499] INFO [Consumer clientId=CruiseControlMetricsReporterSampler--2769550922101085340-consumer-1323935322, groupId=CruiseControlMetricsReporterSampler--2769550922101085340] Seeking to offset 0 for partition __CruiseControlMetrics-1 (org.apache.kafka.clients.consumer.KafkaConsumer)
[2020-06-01 17:51:43,695] INFO Finished sampling for topic partitions [__CruiseControlMetrics-0, __CruiseControlMetrics-2, __CruiseControlMetrics-1] in time range [147734280000,147734400000]. Collected 0 metrics. (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsReporterSampler)
[2020-06-01 17:51:43,695] INFO Collected 0 partition metric samples for 0 partitions. Total partition assigned: 134. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher)_
rohit-kulk commented 4 years ago

Of Snapshots never goes over 1

Error in Cruise Control UI:

ERROR: Error processing GET request '/proposals' due to: 'com.linkedin.kafka.cruisecontrol.exception.KafkaCruiseControlException: com.linkedin.cruisecontrol.exception.NotEnoughValidWindowsException: There are only 0 valid windows when aggregating in range [-1, 1591124626435] for aggregation options (minValidEntityRatio=0.95, minValidEntityGroupRatio=0.00, minValidWindows=1, numEntitiesToInclude=123, granularity=ENTITY)'. 

Adding Debug Logs from Cruise Control:

Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,063] INFO Kicking off partition metric sampling for time range [208320000, 208440000], duration 120000 ms with timeout 120000 ms. (com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcherManager)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,063] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Sending ListOffsetRequest (type=ListOffsetRequest, replicaId=-1, partitionTimestamps={__CruiseControlMetrics-1={timestamp: -1, maxNumOffsets: 1, currentLeaderEpoch: Optional[9]}}, isolationLevel=READ_UNCOMMITTED) to broker kafkastage2.data.nvgrid.net:9092 (id: 1002 rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,063] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Sending ListOffsetRequest (type=ListOffsetRequest, replicaId=-1, partitionTimestamps={__CruiseControlMetrics-2={timestamp: -1, maxNumOffsets: 1, currentLeaderEpoch: Optional[7]}, __CruiseControlMetrics-0={timestamp: -1, maxNumOffsets: 1, currentLeaderEpoch: Optional[7]}}, isolationLevel=READ_UNCOMMITTED) to broker kafkastage1.data.nvgrid.net:9092 (id: 1001 rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,071] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Node 1001 sent an incremental fetch response for session 1315893065 with 1 response partition(s), 1 implied partition(s) (org.apache.kafka.clients.FetchSessionHandler)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,071] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Fetch READ_UNCOMMITTED at offset 16428 for partition __CruiseControlMetrics-0 returned fetch data (error=NONE, highWaterMark=19668, lastStableOffset = 19668, logStartOffset = 0, preferredReadReplica = absent, abortedTransactions = null, recordsSizeInBytes=206700) (org.apache.kafka.clients.consumer.internals.Fetcher)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,071] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Handling ListOffsetResponse response for __CruiseControlMetrics-2. Fetched offset 139, timestamp -1 (org.apache.kafka.clients.consumer.internals.Fetcher)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,071] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Handling ListOffsetResponse response for __CruiseControlMetrics-0. Fetched offset 19668, timestamp -1 (org.apache.kafka.clients.consumer.internals.Fetcher)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,507] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Node 1002 sent an incremental fetch response for session 1916856530 with 0 response partition(s), 1 implied partition(s) (org.apache.kafka.clients.FetchSessionHandler)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,507] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Handling ListOffsetResponse response for __CruiseControlMetrics-1. Fetched offset 196, timestamp -1 (org.apache.kafka.clients.consumer.internals.Fetcher)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,507] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Sending ListOffsetRequest (type=ListOffsetRequest, replicaId=-1, partitionTimestamps={__CruiseControlMetrics-1={timestamp: 208320000, maxNumOffsets: 1, currentLeaderEpoch: Optional[9]}}, isolationLevel=READ_UNCOMMITTED) to broker kafkastage2.data.nvgrid.net:9092 (id: 1002 rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,507] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Sending ListOffsetRequest (type=ListOffsetRequest, replicaId=-1, partitionTimestamps={__CruiseControlMetrics-2={timestamp: 208320000, maxNumOffsets: 1, currentLeaderEpoch: Optional[7]}, __CruiseControlMetrics-0={timestamp: 208320000, maxNumOffsets: 1, currentLeaderEpoch: Optional[7]}}, isolationLevel=READ_UNCOMMITTED) to broker kafkastage1.data.nvgrid.net:9092 (id: 1001 rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,508] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Handling ListOffsetResponse response for __CruiseControlMetrics-1. Fetched offset 0, timestamp 1591120847252 (org.apache.kafka.clients.consumer.internals.Fetcher)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,508] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Handling ListOffsetResponse response for __CruiseControlMetrics-2. Fetched offset 0, timestamp 1591121134540 (org.apache.kafka.clients.consumer.internals.Fetcher)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,508] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Handling ListOffsetResponse response for __CruiseControlMetrics-0. Fetched offset 0, timestamp 1591119992640 (org.apache.kafka.clients.consumer.internals.Fetcher)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,508] INFO [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Seeking to offset 0 for partition __CruiseControlMetrics-0 (org.apache.kafka.clients.consumer.KafkaConsumer)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,508] INFO [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Seeking to offset 0 for partition __CruiseControlMetrics-2 (org.apache.kafka.clients.consumer.KafkaConsumer)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,508] INFO [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Seeking to offset 0 for partition __CruiseControlMetrics-1 (org.apache.kafka.clients.consumer.KafkaConsumer)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,508] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Resuming partitions [__CruiseControlMetrics-2, __CruiseControlMetrics-1, __CruiseControlMetrics-0] (org.apache.kafka.clients.consumer.KafkaConsumer)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,508] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Discarding stale fetch response for partition __CruiseControlMetrics-0 since its offset 16428 does not match the expected offset FetchPosition{offset=0, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=kafkastage1.data.nvgrid.net:9092 (id: 1001 rack: null), epoch=7}} (org.apache.kafka.clients.consumer.internals.Fetcher)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,508] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Added READ_UNCOMMITTED fetch request for partition __CruiseControlMetrics-1 at position FetchPosition{offset=0, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=kafkastage2.data.nvgrid.net:9092 (id: 1002 rack: null), epoch=9}} to node kafkastage2.data.nvgrid.net:9092 (id: 1002 rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,508] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Added READ_UNCOMMITTED fetch request for partition __CruiseControlMetrics-2 at position FetchPosition{offset=0, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=kafkastage1.data.nvgrid.net:9092 (id: 1001 rack: null), epoch=7}} to node kafkastage1.data.nvgrid.net:9092 (id: 1001 rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,508] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Added READ_UNCOMMITTED fetch request for partition __CruiseControlMetrics-0 at position FetchPosition{offset=0, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=kafkastage1.data.nvgrid.net:9092 (id: 1001 rack: null), epoch=7}} to node kafkastage1.data.nvgrid.net:9092 (id: 1001 rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,508] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Built incremental fetch (sessionId=1916856530, epoch=3188) for node 1002. Added 0 partition(s), altered 1 partition(s), removed 0 partition(s) out of 1 partition(s) (org.apache.kafka.clients.FetchSessionHandler)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,508] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Built incremental fetch (sessionId=1315893065, epoch=3472) for node 1001. Added 0 partition(s), altered 2 partition(s), removed 0 partition(s) out of 2 partition(s) (org.apache.kafka.clients.FetchSessionHandler)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,508] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Sending READ_UNCOMMITTED IncrementalFetchRequest(toSend=(__CruiseControlMetrics-1), toForget=(), implied=()) to broker kafkastage2.data.nvgrid.net:9092 (id: 1002 rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,508] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Sending READ_UNCOMMITTED IncrementalFetchRequest(toSend=(__CruiseControlMetrics-2, __CruiseControlMetrics-0), toForget=(), implied=()) to broker kafkastage1.data.nvgrid.net:9092 (id: 1001 rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,509] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Node 1002 sent an incremental fetch response for session 1916856530 with 1 response partition(s) (org.apache.kafka.clients.FetchSessionHandler)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,509] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Fetch READ_UNCOMMITTED at offset 0 for partition __CruiseControlMetrics-1 returned fetch data (error=NONE, highWaterMark=196, lastStableOffset = 196, logStartOffset = 0, preferredReadReplica = absent, abortedTransactions = null, recordsSizeInBytes=17480) (org.apache.kafka.clients.consumer.internals.Fetcher)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,509] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Added READ_UNCOMMITTED fetch request for partition __CruiseControlMetrics-1 at position FetchPosition{offset=196, offsetEpoch=Optional[9], currentLeader=LeaderAndEpoch{leader=kafkastage2.data.nvgrid.net:9092 (id: 1002 rack: null), epoch=9}} to node kafkastage2.data.nvgrid.net:9092 (id: 1002 rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,509] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Built incremental fetch (sessionId=1916856530, epoch=3189) for node 1002. Added 0 partition(s), altered 1 partition(s), removed 0 partition(s) out of 1 partition(s) (org.apache.kafka.clients.FetchSessionHandler)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,509] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Sending READ_UNCOMMITTED IncrementalFetchRequest(toSend=(__CruiseControlMetrics-1), toForget=(), implied=()) to broker kafkastage2.data.nvgrid.net:9092 (id: 1002 rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,509] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Pausing partitions [__CruiseControlMetrics-1] (org.apache.kafka.clients.consumer.KafkaConsumer)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,518] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Node 1001 sent an incremental fetch response for session 1315893065 with 2 response partition(s) (org.apache.kafka.clients.FetchSessionHandler)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,518] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Fetch READ_UNCOMMITTED at offset 0 for partition __CruiseControlMetrics-2 returned fetch data (error=NONE, highWaterMark=139, lastStableOffset = 139, logStartOffset = 0, preferredReadReplica = absent, abortedTransactions = null, recordsSizeInBytes=15301) (org.apache.kafka.clients.consumer.internals.Fetcher)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,518] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Fetch READ_UNCOMMITTED at offset 0 for partition __CruiseControlMetrics-0 returned fetch data (error=NONE, highWaterMark=19668, lastStableOffset = 19668, logStartOffset = 0, preferredReadReplica = absent, abortedTransactions = null, recordsSizeInBytes=1048576) (org.apache.kafka.clients.consumer.internals.Fetcher)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,530] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Added READ_UNCOMMITTED fetch request for partition __CruiseControlMetrics-2 at position FetchPosition{offset=139, offsetEpoch=Optional[7], currentLeader=LeaderAndEpoch{leader=kafkastage1.data.nvgrid.net:9092 (id: 1001 rack: null), epoch=7}} to node kafkastage1.data.nvgrid.net:9092 (id: 1001 rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,530] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Added READ_UNCOMMITTED fetch request for partition __CruiseControlMetrics-0 at position FetchPosition{offset=16428, offsetEpoch=Optional[7], currentLeader=LeaderAndEpoch{leader=kafkastage1.data.nvgrid.net:9092 (id: 1001 rack: null), epoch=7}} to node kafkastage1.data.nvgrid.net:9092 (id: 1001 rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,530] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Built incremental fetch (sessionId=1315893065, epoch=3473) for node 1001. Added 0 partition(s), altered 2 partition(s), removed 0 partition(s) out of 2 partition(s) (org.apache.kafka.clients.FetchSessionHandler)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,530] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Sending READ_UNCOMMITTED IncrementalFetchRequest(toSend=(__CruiseControlMetrics-2, __CruiseControlMetrics-0), toForget=(), implied=()) to broker kafkastage1.data.nvgrid.net:9092 (id: 1001 rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,535] DEBUG [Consumer clientId=CruiseControlMetricsReporterSampler--1866212688180576773-consumer-1813646130, groupId=CruiseControlMetricsReporterSampler--1866212688180576773] Pausing partitions [__CruiseControlMetrics-2, __CruiseControlMetrics-0] (org.apache.kafka.clients.consumer.KafkaConsumer)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,535] INFO Finished sampling for topic partitions [__CruiseControlMetrics-0, __CruiseControlMetrics-2, __CruiseControlMetrics-1] in time range [208320000,208440000]. Collected 0 metrics. (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsReporterSampler)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,535] INFO Collected 0 partition metric samples for 0 partitions. Total partition assigned: 123. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,535] INFO Collected 0 broker metric samples for 0 brokers. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher)
Jun 02 19:04:14 cruisecontrolstage2 cruise-control[11824]: [2020-06-02 19:04:14,535] INFO Finished sampling in 472 ms. (com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcherManager
rohit-kulk commented 4 years ago

When I kick off Bootstrap, I see this in Logs:

INFO Kicking off partition metric sampling for time range [208320000, 208440000]

So it seems like its sampling for some wrong time range during bootstrap?

When I restart Cruise Control, the it works as expected. But when I run bootstrap metrics, its stuck in infinite loop

rohit-kulk commented 4 years ago

I found the Issue. The Reason is that the Cruise Control UI automatically starts from time = 0 and no ending time. This causes it to get stuck in infinite loop.

Instead, bootstrap Metrics should have some option on UI for start and end-time Ex. Start time = now - 1 day and End time = now

Not having this option on UI makes the UI unusable for Bootstrapping Metrics. I was able to fix this by running manual API Call and hard-code the start and end times.

kafkacruisecontrol/bootstrap?clearmetrics=true&start=1591130000000&end=1591137027000&json=true

rohit-kulk commented 4 years ago

Opened separate issue for this - https://github.com/linkedin/cruise-control-ui/issues/46