Closed ramanenka closed 2 years ago
There is a document with some proposed changes to CRDs: https://docs.google.com/document/d/1WoFcwN5bZfglX1BGZnch2RVpa-VBOsKkzpgPho6x1SQ/edit# ... I think the plan is to discuss it on Thursday next week in the Strimzi Community Meeting (https://docs.google.com/document/d/1V1lMeMwn6d2x1LKxyydhjo2c_IFANveelLD880A6bYc/edit). Maybe this is something what should be taken into account for these changes. Since it si probably not easy to change the current implementation without affecting existing users.
CC @tombentley
Any updates on this?
No, we have not done any changes to this as part of the CRD changes. I do not think anyone joined the community call to discussed this particular issue as suggested in here at that time.
Ok. In the mean time, just to confirm, there is no way for us to adjust the DYNAMIC_HEAP_FRACTION?
I think you should start by explaining your problem with the current settings. If you go through the code, for KafkaConnect, it IMHO uses the values from -XX:+UseContainerSupport
. From the code it also looks like you can configure your own custom KAFKA_HEAP_OPTS
environment variable which overrides this.
@scholzj,
Sorry, you're right. The issue we ran into was that we found our KafkaConnect clusters were getting OOMed quite regularly (several times an hour). When we dug into it, we found that the -Xmx
and -Xms
settings used inside the containers were set to ~100.2%
of the actual available Pod memory. No matter what we did to the pod memory, this happened. After a while we tracked it down to the fact that we were not setting our own Xmx
and Xms
settings, which triggered the dynamic_resources.sh code to take over:
$ k describe pod kafka-connect-s3-sink-connect-9f499898f-p45qc
Name: kafka-connect-s3-sink-connect-9f499898f-p45qc
Namespace: xxx
Priority: 1000000
Priority Class Name: default
...
Limits:
cpu: 1254m
memory: 9148093649
Requests:
cpu: 627m
memory: 9148093649
Now here is the JVM auto-determined memory setting:
[kafka@kafka-connect-s3-sink-connect-9f499898f-p45qc kafka]$ get_heap_size 1.0
9148280340
$ ps -ef | grep Xmx
kafka 66 1 29 13:48 ? 00:06:44 java -Xms9148280340 -Xmx9148280340 -server -XX:+UseG1GC ....
9148280340
is greater than 9148093649
... which means we're telling the JVM to use more than 100% of the actual available memory.
[kafka@kafka-connect-s3-sink-connect-9f499898f-p45qc kafka]$ cat dynamic_resources.sh
set -e
function get_heap_size {
FRACTION=$1
MAX=$2
# Get the max heap used by a jvm which used all the ram available to the container
MAX_POSSIBLE_HEAP=$(java -XX:+UnlockExperimentalVMOptions -XX:+UseContainerSupport -XX:MaxRAMFraction=1 -XshowSettings:vm -version \
|& awk '/Max\. Heap Size \(Estimated\): [0-9KMG]+/{ print $5}' \
| gawk -f to_bytes.gawk)
ACTUAL_MAX_HEAP=$(echo "${MAX_POSSIBLE_HEAP} ${FRACTION}" | awk '{ printf "%d", $1 * $2 }')
if [ "${MAX}" ]; then
MAX=$(echo "${MAX}" | gawk -f to_bytes.gawk)
if [ "${MAX}" -lt "${ACTUAL_MAX_HEAP}" ]; then
ACTUAL_MAX_HEAP=$MAX
fi
fi
echo "$ACTUAL_MAX_HEAP"
}
[kafka@kafka-connect-s3-sink-connect-9f499898f-p45qc kafka]$ cat to_bytes.gawk
BEGIN {
suffixes[""]=1
suffixes["K"]=1024
suffixes["M"]=1024**2
suffixes["G"]=1024**3
}
match($0, /([0-9.]*)([kKmMgG]?)/, a) {
printf("%d", a[1] * suffixes[toupper(a[2])])
}
$ java -XX:+UnlockExperimentalVMOptions -XX:+UseContainerSupport -XX:MaxRAMFraction=1 -XshowSettings:vm -version
OpenJDK 64-Bit Server VM warning: Option MaxRAMFraction was deprecated in version 10.0 and will likely be removed in a future release.
VM settings:
Max. Heap Size (Estimated): 8.52G
Using VM: OpenJDK 64-Bit Server VM
openjdk version "11.0.9.1" 2020-11-04 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.9.1+1-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.9.1+1-LTS, mixed mode, sharing)
Overall I totally support the idea of dynamically allocating the Java settings based on the container size.. but setting a JVM to ever use near 100% of the available container memory seems like a recipe for problems in my experience.
But that is what I assume Java gave us?
CC @tombentley
I guess we should always use min(limit, `get_heap_size 1.0`)
, because the excuse "this is what Java gave us" will cut no ice with the OOMKiller.
We need to deal with OpenJDK 64-Bit Server VM warning: Option MaxRAMFraction was deprecated in version 10.0 and will likely be removed in a future release.
as well
@tombentley So - in my experience with Java and containers, you almost never want your max heap size to be anywhere near your container size. Just as an example, we set our Xmx
to 2GB, and we're using an average of 2.4GB in our 4GB containers. I think the DYNAMIC_HEAP_FRACTION
setting is great because it can let users use the VerticalPodAutoscaler if they want ... but I would recommend a more conservative setting of something like 0.85
, and allow users to set it on their own. I tried setting it in the env
key of the CR, but it was overridden and set to 1.0
anyways.
@diranged well you never want your max heap size to be your limit anyway, since the JVM can uses plenty of other memory apart from its heap. That was the idea behind the MAX_POSSIBLE_HEAP
calculation (see above): We ask the JVM how big the heap would be if we were to use all the RAM available to the container (-XX:MaxRAMFraction=1
) then scale that by FRACTION
to arrive at the actual heap size to request. For some containers we use FRACTION=1
, but that should still result in the -Xmx
that's sufficiently less than the container memory limit, assuming javac
is giving accurate numbers. That was the theory anyway, but I guess it was optimistic to assume Max. Heap Size (Estimated)
should mean "Max. Heap Size (Guaranteed)", which is, I guess, where this bug arises.
I'd be interested to know if you can spot a hole in the theory that it ought to be OK to dedicate all the container memory to java
assuming you could had the option to actually control the jvm memory accurately.
Practically, I'm not sure whether a more conservative fraction is better than using the min
I suggested above but with a more conservative value than container.limit
. I guess it depends on whether java
s error in the reported Max. Heap Size (Estimated)
scales with heap size.
I'd be interested to know if you can spot a hole in the theory that it ought to be OK to dedicate all the container memory to java assuming you could had the option to actually control the jvm memory accurately.
So - if there is a way to ensure that Java never uses more than the available container memory, then that's fine.. I certainly agree that we should maximize the available memory for the application and waste as little as possible. I just haven't seen a situation where you can accurately get an explicit amount of total theoretical memory that the JVM will use outside of the main application space. Maybe that is something that can be calculated?
About OpenJDK 64-Bit Server VM warning: Option MaxRAMFraction was deprecated in version 10.0 and will likely be removed in a future release.
we need to use MaxRAMPercentage
instead, apparently. MaxRAMFraction=n
meant we were using 1/n
th of the available RAM, so it was a blunt instrument for the sort of thing we're doing here.
Maybe that is something that can be calculated?
When trying to use all of the RAM we're treading a tightrope with the kernel's OOMKiller
on the one side and the JVM's OutOfMemoryError
on the other side. I guess we could now switch to something like MaxRAMPercentage=95
, or even 99, given that the observation $ get_heap_size 1.0=9148280340
resulted was an over-estimate by 0.002%. Wdyt @diranged?
Using 99%
works if you can figure out the "other ram" that Java needs to do its business. Like I said, in our example case here we're seeing ~20% on top of the Xmx
setting being used for "other stuff in java". Whatever "better default" you pick, I'd like to make sure that the operator can easily tune that number to their own workloads.
Well in theory that's what the MaxRAMPercentage
and looking at the max heap estimate should account for. Ultimately people can always set Xmx
explicitly anyway, though I'm sure some people assume that max heap==max rss
and then get OOMKilled.
Can you point me to the documentation where you are reading up on MaxRAMPercentage
? From what I am reading, MaxRAMPercentage
is just a different way of expressing how much memory total to use out of the available cgroup memory from the container. I am not seeing anything that talks about taking into account additional java overhead.
Eg.. here are our live running containers. Notice their memory usage is ~300Mb above the 2GB mark?
kafka-connect-s3-sink-connect-74ffbf8d5d-2p6nw kafka-connect-s3-sink-connect 762m 2373Mi
kafka-connect-s3-sink-connect-74ffbf8d5d-8k7gk kafka-connect-s3-sink-connect 664m 2368Mi
kafka-connect-s3-sink-connect-74ffbf8d5d-psppp kafka-connect-s3-sink-connect 737m 2381Mi
Now let's look at the config of one container:
$ k describe pod kafka-connect-s3-sink-connect-74ffbf8d5d-psppp
Name: kafka-connect-s3-sink-connect-74ffbf8d5d-psppp
Namespace: lightstream
Priority: 1000000
...
Containers:
kafka-connect-s3-sink-connect:
Container ID: containerd://2feced257e007b6bf9f4b2705a54cffc3b16e232d7947808b4a86dca6da584d6
...
Limits:
cpu: 2
memory: 4Gi
Requests:
cpu: 1
memory: 4Gi
...
Environment:
KAFKA_HEAP_OPTS: -Xms2048m -Xmx2048m
So we have a heap hard-coded at 2GB, yet we're using 2.3GB of memory in the container. When we hop into that container and test the java ...
commands:
$ k exec -ti kafka-connect-s3-sink-connect-74ffbf8d5d-psppp -- bash
[kafka@kafka-connect-s3-sink-connect-74ffbf8d5d-psppp kafka]$ java -XX:+PrintFlagsFinal -version | grep -Ei "maxheapsize|maxram"
size_t MaxHeapSize = 1073741824 {product} {ergonomic}
uint64_t MaxRAM = 137438953472 {pd product} {default}
uintx MaxRAMFraction = 4 {product} {default}
double MaxRAMPercentage = 25.000000 {product} {default}
You can see the default behavior of MaxRAMPercentage=0.25
gives us a MaxHeapSize
of 1GB... which is exactly 25% of the total container size. If you set the MaxRAMPercenage=1
then we end up with a MaxHeapSize
of 4GB:
[kafka@kafka-connect-s3-sink-connect-74ffbf8d5d-psppp kafka]$ java -XX:+PrintFlagsFinal -XX:MaxRAMFraction=1 -version | grep -Ei "maxheapsize|maxram"
OpenJDK 64-Bit Server VM warning: Option MaxRAMFraction was deprecated in version 10.0 and will likely be removed in a future release.
size_t MaxHeapSize = 4294967296 {product} {ergonomic}
uint64_t MaxRAM = 137438953472 {pd product} {default}
uintx MaxRAMFraction = 1 {product} {command line}
double MaxRAMPercentage = 100.000000 {product} {default}
Based on what I am seeing here, there are no calculations in these numbers to help you avoid an OOM situation if you set your max heap size to exactly the same amount of memory your container's cgroup limit is set to.
PID TID CLS RTPRIO STAT VSZ RSS COMMAND
1 1 TS - Ss 4376 740 tini
43 43 TS - Sl 9535996 2452096 java
592 592 TS - Ss 11848 2996 bash
663 663 TS - R+ 49628 3064 ps
[kafka@kafka-connect-s3-sink-connect-74ffbf8d5d-psppp kafka]$
Side Note: Using the raw byte numbers from the java commands I pasted above seems better and less error prone than trying to convert "human readable" memory sizes (4.1GB
) into Bytes. Rounding errors are the reason we were actually seeing the heap size get set to GREATER than the total container size, rather than at least equaling it.
Can you point me to the documentation where you are reading up on MaxRAMPercentage
I can't. This code was originally written around the time that -XX:+UseCGroupMemoryLimitForHeap
was working its way into Java8, and it was all very poorly documented, so it was based on various blogs posts etc. Having done a bit of searching it seems that the "Java 8u131 and Java 9" section of https://merikan.com/2019/04/jvm-in-a-container/ represents the thinking at the time I implemented this. At some point we've switched to -XX:+UseContainerSupport
and perhaps this explains the difference in behaviour, and so the old logic is now flawed, as you observe.
Given that the JVM doesn't really let us do this any more, I guess the simplest thing to do would be something like Xmx = 85% of the cgroup limit
as you suggest.
Given that the JVM doesn't really let us do this any more, I guess the simplest thing to do would be something like
Xmx = 85% of the cgroup limit
as you suggest.
:+1: .. but please make the setting customizable...
Triaged on 12.4.2022: We now use Java 11 which has the container support enabled by default. The default percentage of used memory was changed to 75% in #6602 - this can be closed now.
Is your feature request related to a problem? Please describe. Java 10 introduced
-XX:+UseContainerSupport
(enabled by default) which makes the JVM use sane defaults in a container environment. This feature is backported to Java 8 since 8u191. These defaults could be further adjusted with-XX:{Min|Max}RAMPercentage
flags. For this to work-Xms, -Xmx
flags should not be specified otherwise-XX:{Min|Max}RAMPercentage
are not taken into account.In our case we'd like to specify the memory limit once on the KafkaConnect custom resource level and have the JVM to calculate the heap size according to
-XX:{Min|Max}RAMPercentage
flag. This is currently impossible because even if-Xms, -Xmx
are omited on the KafkaConnect custom resource these flags are set anyway upon the container start up. This only happens whenDYNAMIC_HEAP_FRACTION
env variable is set, but it's always set when-Xmx
is not provided and it looks like it's impossible to customize this behavior.Describe the solution you'd like Provide a way to unset
DYNAMIC_HEAP_FRACTION
env variable when-Xmx
is not provided. This is going to makekafka_connect_run.sh
to skip-Xms, -Xmx
dynamic calculation and allow JVM to calculate heap size according to-XX:{Min|Max}RAMPercentage
allowing us to specify the memory limit in a single place.Describe alternatives you've considered Currently we have to specify memory limits on both levels: the KafkaConnect custom resource level and the JVM level.