Closed mpuch12 closed 2 years ago
I don't think 50m
CPU will be enough. can you try it without it?
If that does not help, you will need to find out why is it restarting ... it should be normally in the status, in the events or in the Kubernetes logs.
I rather suspect the cgroup's memory limit kicks in and thus it periodically gets OOMKilled - at least that's the issue on my installation over here:
State: Running
Started: Wed, 25 May 2022 14:48:13 +0200
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Wed, 25 May 2022 14:33:35 +0200
Finished: Wed, 25 May 2022 14:48:11 +0200
Ready: True
Restart Count: 233
Limits:
cpu: 1
memory: 384Mi
Requests:
cpu: 200m
memory: 384Mi
however, @mpuch12 had no Reason: OOMKilled
but Reason: Error
- despite Exit Code: 137
usually hints at being OOMKilled.
I've live-watched the memory increasing using:
while [ 0 ]; do kubectl get --raw /apis/metrics.k8s.io/v1beta1/pods | jq -r '.items[] | select(.metadata.labels.name == "strimzi-cluster-operator") | {timestamp, "name": .containers[0].name, "memory": .containers[0].usage.memory} | join("\t")'; sleep 10; done
and it really is slowly but yet constantly increasing and eventually it showed 390724Ki
just before getting killed.
The new instance (re-)starts at 142940Ki
.
I've just "wildly" raised the memory limit in my deployment and will watch its behavior
I am seeing exactly the same issue on strimzi 0.29. The operator is restarting every few minutes. After bumping memory to 1 Gi
and looking at memory graph it looks like there is some memory leak, as the memory is slowly increasing and after 30-45 mins the pod restarts.
@doriath So do you see any OOM issues or not?
You are mixing what are possibly unrelated issues here. So it is quite hard to separate it. If you see OOM issues, you should:
Increasing the memory on its own is not always the solution. if you give it more memory, it uses more memory. You might need to tune the JVM settings instead.
I see exactly the same symptomps as @mpuch12
Environment:
The only customization in values.yaml in helm chart is watchAnyNamespace: true
. I have 1 Kafka
CRD and 2 KafkaTopic
CRDs.
The strimzi operator restarts every 5-10 minutes, and the Pod also restarts with Error code 137. The pod restarts the moment the container hits memory limit.
How can I check the JVM metrics in the operator? Could you recommend JVM settings I can try tweaking?
How can I check the JVM metrics in the operator? Could you recommend JVM settings I can try tweaking?
JVM metrics are part of the operator Prometheus metrics. So you can just scrape them. You can use the JAVA_OPTS
environment variable to pass any Java options. So you can configure your own -Xmx
, -XX:MaxRAMPercentage
etc.
It is weirs. My installation has Kafka CR + Connect CR + many KafkaConnector resources but does not restart. So I wonder what is different in your case :-/. I have no idea what Talos is - is that some cloud provider or something? My long-running cluster runs on bare-metal.
Thank you for the tips. First, a little more information:
I am using Talos OS v0.14.3
(minimal OS for kubernetes), which uses containerd
v1.5.10
. I run on VPS with 16 GiB of RAM. I found some issues pointing that sometimes java does not return valid RAM in containers with containerd, so maybe a particular version of java used by the operator and containerd
had some issue.
I set the -Xmx=256m
and increased the limits
in kubernetes to 512Mi
and the operator did not restart even once in the last 24h, and memory usage is now at 483 Mi
. I will now try to set -Xmx=79m
(20% of 394) and will see how the operator behaves.
Weird. I'm not sure I'm aware of any issues with Java 11 and detecting the container resources. My cluster uses containerd as well, but I'm traveling so cannot check the exact version. Please keep me posted if your config changes helped.
PS: Just to double check - you are running on AMD64 and not on Arm64 or s390x, or?
I am using cgroup v2 and seeing the same issue on strimzi 0.29. i found that oom killed occurred and the operator restarted. i search for such a reason and find this article it seems jdk11 doesn't support cgroup v2 so *RAMPercentage uses host memory, not container memory
it seems jdk11 doesn't support cgroup v2 so *RAMPercentage uses as host memory, not container memory
I guess you can then pass -Xmx
as an option to it using the JAVA_OPTS
environment variable.
Had the same issue as @mpuch12 and used the fix from above - Update launch_java.sh - works like a charm!
Thx a lot joseacl
You're welcome! FYI I've updated the PR in order to do the same but with less logic after applying suggestions from @scholzj here the last version of it launch_java.sh
You're welcome! FYI I've updated the PR in order to do the same but with less logic after applying suggestions from @scholzj here the last version of it launch_java.sh
Am also affected - for applying the mentioned fixes in launch_java.sh - I guess I do need to rebuild the strimzi-operator image with the patched launch_java.sh, right? Will this fix be shipped soon with a release, so I can skip this step as a workaround?
I have deployed strimzi operator with same image tag, working fine on my side, Can u plz verify the resource bindings on your side
This should be (hopefully) fixed in the 0.30.0 release where the support for CGroups v2 was added.
Describe the bug After deploy cluster operator restart periodically (typically after 15-20 minutes), without any error in logs
Environment:
YAML files and logs
Deploy using:
helm upgrade kafka-operator strimzi/strimzi-kafka-operator --namespace kafka --version 0.29.0 --install --create-namespace --wait --timeout 300s --set resources.requests.cpu=50m
Operator last logs: