linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

Get cluster load and partition load API fail with NotEnoughValidWindowsException #1885

Open dtakis opened 2 years ago

dtakis commented 2 years ago

Following the AWS cruise control and cruise control ui installation and configuration against Kafka MSK, I end up in the following situation where cruise control (2.5.42) is connected to MSK (Kafka 2.7.1) but some API calls throw NotEnoughValidWindowsException and while using the Cruise Control UI (0.4.0) I also see the CORS message while I have configured cruise control following the suggested configuration.

CORS

User-Task-ID header is not found in the response from the server. If you are using [CORS](https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS), please add necessary configuration to your Cruise Control as described [in this wiki.](https://github.com/linkedin/cruise-control-ui/wiki/CORS-Method)

/load and /partition_load API calls failures

Error processing GET request '/load' due to: 'com.linkedin.kafka.cruisecontrol.exception.KafkaCruiseControlException: com.linkedin.cruisecontrol.exception.NotEnoughValidWindowsException: There is no window available in range [-1, 1660737061113] (index [1, -1]). Window index (current: 0, oldest: 0)

Error processing GET request '/partition_load' due to: 'com.linkedin.kafka.cruisecontrol.exception.KafkaCruiseControlException: com.linkedin.cruisecontrol.exception.NotEnoughValidWindowsException: There is no window available in range [-1, 1660736535954] (index [1, -1]). Window index (current: 0, oldest: 0)

I noticed that the Monitor State is continuously running and training is stuck at 0.00% while the __CruiseControlMetrics topic is not created. I guess that Cruise Control does not reach the point to be able to create and write in this topic. The topics

__KafkaCruiseControlModelTrainingSamples
__KafkaCruiseControlPartitionMetricSamples 

were successfully created though.

Thank you in advance for your insights as I see that these errors are very common issues reported here but I could not make any of the suggestions work

dtakis commented 2 years ago

Hello! Anyone who has seen this in the past? I believe that the issue impacts the creation of __CruiseControlMetrics topic

dtakis commented 2 years ago

I see a similar or duplicate of my issue: https://github.com/linkedin/cruise-control/issues/1904

aws0 commented 2 years ago

Not sure if this helps, but I recently faced similar issue on Docker based install of CC , it turned out to be an issue with metrics reporter not bootstrapping to the cluster; we forgot to enter proper authentication configs on server.properties ( we use kerberos with SASL_SSL)

These are the server.properties configs that we fixed / provided:

cruise.control.metrics.reporter.ssl.truststore.location
cruise.control.metrics.reporter.ssl.truststore.password
cruise.control.metrics.reporter.bootstrap.servers
cruise.control.metrics.reporter.sasl.mechanism
cruise.control.metrics.reporter.sasl.kerberos.service.name
cruise.control.metrics.reporter.sasl.jaas.config

After setting above to proper values and restarting the Kafka cluster nodes (I also restarted the CC server as well - but maybe it's not required) we noticed the "User-Task-ID missing ... NotEnoughValidWindowsException" error disappeared after a while and we started seeing data showing up on that tab. So this error IMHO seems to be misleading as the real cause is actually not enough data in the topics of CC on the Kafka cluster which is produced by the metrics sampler client (cruise-control-metrics-reporter.jar). In our case the mertics reporter was not able to bootstrap properly and hence was not producing anything to the CC topics.

In addition to above, note the following:

I hope this helps.

felipeavilis commented 8 months ago

Hello! Anyone who has seen this in the past? I believe that the issue impacts the creation of __CruiseControlMetrics topic

Hi @dtakis. I'm facing the same issue in AWS MSK. Have you solved it? I f yes, could you please share with us?

felipeavilis commented 8 months ago

Not sure if this helps, but I recently faced similar issue on Docker based install of CC , it turned out to be an issue with metrics reporter not bootstrapping to the cluster; we forgot to enter proper authentication configs on server.properties ( we use kerberos with SASL_SSL)

These are the server.properties configs that we fixed / provided:

cruise.control.metrics.reporter.ssl.truststore.location
cruise.control.metrics.reporter.ssl.truststore.password
cruise.control.metrics.reporter.bootstrap.servers
cruise.control.metrics.reporter.sasl.mechanism
cruise.control.metrics.reporter.sasl.kerberos.service.name
cruise.control.metrics.reporter.sasl.jaas.config

After setting above to proper values and restarting the Kafka cluster nodes (I also restarted the CC server as well - but maybe it's not required) we noticed the "User-Task-ID missing ... NotEnoughValidWindowsException" error disappeared after a while and we started seeing data showing up on that tab. So this error IMHO seems to be misleading as the real cause is actually not enough data in the topics of CC on the Kafka cluster which is produced by the metrics sampler client (cruise-control-metrics-reporter.jar). In our case the mertics reporter was not able to bootstrap properly and hence was not producing anything to the CC topics.

In addition to above, note the following:

  • I'm using RHEL 7 host for the Docker host
  • I'm using a more recent version of CC: tag: 2.5.101
  • upgraded systemd to version 234
  • upgraded kernel to 5.x which has better cgroup capabilities
  • upgraded Docker service on host

I hope this helps.

This configs should be places in MSK Configuration?

dtakis commented 7 months ago

No @felipeavilis , I paused the debugging and never came back to continue :(