Cruise control w/ default configuration won't generate a proposal.

benmathews commented 2 years ago

Describe the bug I've deployed a Kafka crd with the default cruiseControl configuration. Cruise Control starts, but doesn't generate proposals.

To Reproduce Steps to reproduce the behavior:

Deploy Kafka object
Deploy KafkaRebalance object
Describe KafkaRebalance object
Observe that Status.Conditions.Type=PendingProposal

Observe that the cruise control pod has errors

operationLogger:742 - Task [e2feb1d4-e0f7-45df-8c20-c8f971948701] calculation fails, exception:
java.util.concurrent.ExecutionException: Operation 'Rebalance' received exception. com.linkedin.kafka.cruisecontrol.exception.KafkaCruiseControlException: com.linkedin.cruisecontrol.exception.NotEnoughValidWindowsException: There is no window available in range [-1, 1657129787915] (index [1, -1]). Wind
[report-06-07-2022_15-19-51.zip](https://github.com/strimzi/strimzi-kafka-operator/files/9058096/report-06-07-2022_15-19-51.zip)
ow index (current: 0, oldest: 0).

Expected behavior A KafkaRebalance object should have a rebalance proposal generated.

Environment (please complete the following information):

Strimzi version: 0.29.0
Installation method: Helm chart strimzi-kafka-operator-0.29.0
Kubernetes cluster: Kubernetes 1.21.12
Infrastructure: Self hosted cluster w/ 6 nodes tainted for exclusive use by Kafka

YAML files and logs attached

benmathews commented 2 years ago

report-06-07-2022_15-19-51.zip

scholzj commented 2 years ago

Probably a question rather than a bug?

CC @tomncooper @ppatierno @kyguy

ppatierno commented 2 years ago

When Cruise Control starts it has to gather enough metrics samples from the cluster in order to build an optimization proposal. If you ask for a rebalance right after the cluster is started, having that behaviour is normal but it should go away after 5-10 mins and the proposal will become ready.

benmathews commented 2 years ago

This issue persists far longer than the initial 5-10 minutes. I tore everything down and set it back up yesterday before writing this ticket and it has been running since then spitting out a steady stream of errors in the cruise control logs. The most recent one:

INFO  AbstractAsyncRequest:50 - Processing async request RebalanceRequest.
INFO  UserTaskManager:263 - Create a new UserTask 33f1c3ea-ce86-411f-8431-9e813043aad8 with SessionKey SessionKey{_httpSession=Session@6f7cd8f8{id=node0w4yqqa4lsj0g1ryks124bu6vh29013,x=node0w4yqqa4lsj0g1ryks124bu6vh29013.node0,req=1,res=true},_requestUrl=POST /kafkacruisecontrol/rebalance,_queryParams={dryrun=[true], skip_hard_goal_check=[false], rebalance_disk=[false], json=[true], verbose=[true]}}
WARN  OperationRunnable:50 - Received exception when trying to execute runnable for "Rebalance"
com.linkedin.kafka.cruisecontrol.exception.KafkaCruiseControlException: com.linkedin.cruisecontrol.exception.NotEnoughValidWindowsException: There is no window available in range [-1, 1657206259740] (index [1, -1]). Window index (current: 0, oldest: 0).
....

kyguy commented 2 years ago

Hey @benmathews, looking at the cruise control logs it appears that no metrics samples are being generated because the BROKER_CPU_UTIL metric is missing from the metrics topic:

2022-07-06 20:48:57 INFO  CruiseControlMetricsReporterSampler:116 - Finished sampling from topic strimzi.cruisecontrol.metrics for partitions [0] in time range [1657140477881,1657140537881]. Collected 1139 metrics.
2022-07-06 20:48:57 WARN  SamplingUtils:302 - Skip generating metric sample for broker 2 because the following required metrics are missing [BROKER_CPU_UTIL].
2022-07-06 20:48:57 WARN  SamplingUtils:302 - Skip generating metric sample for broker 1 because the following required metrics are missing [BROKER_CPU_UTIL].
2022-07-06 20:48:57 WARN  SamplingUtils:302 - Skip generating metric sample for broker 0 because the following required metrics are missing [BROKER_CPU_UTIL].
2022-07-06 20:48:57 WARN  SamplingUtils:302 - Skip generating metric sample for broker 3 because the following required metrics are missing [BROKER_CPU_UTIL].
2022-07-06 20:48:57 WARN  SamplingUtils:302 - Skip generating metric sample for broker 5 because the following required metrics are missing [BROKER_CPU_UTIL].
2022-07-06 20:48:57 WARN  SamplingUtils:302 - Skip generating metric sample for broker 4 because the following required metrics are missing [BROKER_CPU_UTIL].
2022-07-06 20:48:57 INFO  CruiseControlMetricsProcessor:150 - Generated 0(310 skipped by broker {0=48, 1=48, 2=55, 3=51, 4=56, 5=52}) partition metric samples and 0(6 skipped) broker metric samples for timestamp 1657140536904.

This could be a problem with how the Cruise Control metrics reporter is gathering/storing metrics. Not sure why this is happening but I will take a deeper look and get back to you

kyguy commented 2 years ago

Looking at the Kafka logs, it appears to be a problem with how Cruise Control Metrics Reporter is gathering CPU metrics. [1] For some reason, the CPU usage as measured by the [(OperatingSystemMXBean).getProcessCpuLoad()](https://docs.oracle.com/en/java/javase/11/docs/api/jdk.management/com/sun/management/OperatingSystemMXBean.html#getProcessCpuLoad()) method [2] is not available, and therefore returns a negative value triggering this IOException:

2022-07-06 20:52:46,675 WARN Failed reporting CPU util. (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter) [CruiseControlMetricsReporterRunner]
java.io.IOException: Java Virtual Machine recent CPU usage is not available.
    at com.linkedin.kafka.cruisecontrol.metricsreporter.metric.MetricsUtils.getCpuMetric(MetricsUtils.java:413)
    at com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter.reportCpuUtils(CruiseControlMetricsReporter.java:449)
    at com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter.run(CruiseControlMetricsReporter.java:367)
    at java.base/java.lang.Thread.run(Thread.java:829)

Still trying to figure out how the CPU usage could not be available

[1] https://github.com/linkedin/cruise-control/blob/migrate_to_kafka_2_5/cruise-control-metrics-reporter/src/main/java/com/linkedin/kafka/cruisecontrol/metricsreporter/CruiseControlMetricsReporter.java#L446-L454 [2] https://github.com/linkedin/cruise-control/blob/migrate_to_kafka_2_5/cruise-control-metrics-reporter/src/main/java/com/linkedin/kafka/cruisecontrol/metricsreporter/metric/MetricsUtils.java#L405-L416

kyguy commented 2 years ago

@benmathews Could you tell me more about your environment? e.g. like operating system, container engine, cgroups version, cgroups driver etc? The issue may be related and I want to double check a few things

benmathews commented 2 years ago

$ uname -a
Linux se-kube-kafka03.vivintsky.com 5.13.0-44-generic #49~20.04.1-Ubuntu SMP Wed May 18 18:44:28 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

$ runc --version
runc version 1.0.3
commit: v1.0.3-0-gf46b6ba2
spec: 1.0.2-dev
go: go1.17.8
libseccomp: 2.5.1

$ containerd --version
containerd github.com/containerd/containerd v1.5.11 3df54a852345ae127d1fa3092b95168e4a88e2f8

$ grep cgroup /proc/filesystems
nodev   cgroup
nodev   cgroup2

Kubelet is configured with

--runtime-cgroups=/systemd/system.slice

And containerd has this set systemd_cgroup = true

$ systemd --version
systemd 245 (245.4-4ubuntu3.17)
+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid

kyguy commented 2 years ago

Thanks @benmathews! It appears the nodes are using systemd for their cgroup driver, now we just need to find out whether systemd is configured to use cgroups v1 or cgroups v2.

What is the stdout when running the following command on one of the nodes?

[ $(stat -fc %T /sys/fs/cgroup/) = "cgroup2fs" ] && echo "unified" || ( [ -e /sys/fs/cgroup/unified/ ] && echo "hybrid" || echo "legacy")

OpenJDK 11 has only recently been patched to detect cgroups v2 boundaries.[1] So OpenJDK 11 versions before 11.0.16 like the OpenJDK version which the Strimzi 0.29.0 Kafka image uses, 11.0.15, have not been patched to detect cgroups v2 boundaries and therefore could fail when using methods like [OperatingSystemMXBean).getProcessCpuLoad()](https://docs.oracle.com/en/java/javase/11/docs/api/jdk.management/com/sun/management/OperatingSystemMXBean.html#getProcessCpuLoad()).

[1] https://bugs.openjdk.org/browse/JDK-8230305

benmathews commented 2 years ago

hybrid

kyguy commented 2 years ago

Thanks @benmathews, so from what I understand, even if your nodes support a hybrid configuration, Kubernetes only supports using the same cgroup version to manage all of its controllers [1]

Even if the kernel supports a hybrid configuration where some controllers are managed by cgroup v1 and some others by cgroup v2, Kubernetes supports only the same cgroup version to manage all the controllers.

So we still need to find out whether the Kubernetes cluster is configured to use cgroups v1 or cgroups v2. I suspect that the cluster is using cgroups v2. Could you confirm which cgroups version your Kubernetes cluster is using?

Does the stdout of the following command on one of the nodes give any information?:

grep unified_cgroup_hierarchy /etc/default/grub

Could you find what the following boot option is set to:

systemd.unified_cgroup_hierarchy

[1] https://kubernetes.io/docs/setup/production-environment/_print/#cgroup-v2

benmathews commented 2 years ago

@kyguy based on the link you provided, I had a coworker reboot all the kafka nodes with systemd.unified_cgroup_hierarchy=1 set. That resolved the problem. I can annotate a KafkaRebalance to refresh and approve a proposal. Then, CC executes the proposal properly.

The only lingering question before closing this ticket, is if something could be added to strimzi to detect and warn of this problem?

kyguy commented 2 years ago

based on the link you provided, I had a coworker reboot all the kafka nodes with systemd.unified_cgroup_hierarchy=1 set. That resolved the problem. I can annotate a KafkaRebalance to refresh and approve a proposal. Then, CC executes the proposal properly.

Glad it is working now @benmathews! So from what understand this, the solution was to configure the Kubernetes nodes to use the same cgroups version either cgroups v1 by setting systemd.unified_cgroup_hierarchy=0 or cgroups v2 by setting systemd.unified_cgroup_hierarchy=1.

The only lingering question before closing this ticket, is if something could be added to strimzi to detect and warn of this problem?

Unfortunately, environment issues like this are not something that Strimzi can easily detect. That being said, the updates that are coming with OpenJDK 11.0.16 may have been able to help mitigate this particular issue but the cluster would still have been vulnerable to other cgroup issues

[1] https://bugs.openjdk.org/browse/JDK-8230305

benmathews commented 2 years ago

@kyguy thanks a lot for walking me through diagnosing and resolving this.

strimzi / strimzi-kafka-operator

Cruise control w/ default configuration won't generate a proposal. #7024