strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes
https://strimzi.io/
Apache License 2.0
4.82k stars 1.29k forks source link

[Enhancement] Support -XX:+UseContainerSupport for KafkaConnect pods #4094

Closed ramanenka closed 2 years ago

ramanenka commented 3 years ago

Is your feature request related to a problem? Please describe. Java 10 introduced -XX:+UseContainerSupport (enabled by default) which makes the JVM use sane defaults in a container environment. This feature is backported to Java 8 since 8u191. These defaults could be further adjusted with -XX:{Min|Max}RAMPercentage flags. For this to work -Xms, -Xmx flags should not be specified otherwise -XX:{Min|Max}RAMPercentage are not taken into account.

In our case we'd like to specify the memory limit once on the KafkaConnect custom resource level and have the JVM to calculate the heap size according to -XX:{Min|Max}RAMPercentage flag. This is currently impossible because even if -Xms, -Xmx are omited on the KafkaConnect custom resource these flags are set anyway upon the container start up. This only happens when DYNAMIC_HEAP_FRACTION env variable is set, but it's always set when -Xmx is not provided and it looks like it's impossible to customize this behavior.

Describe the solution you'd like Provide a way to unset DYNAMIC_HEAP_FRACTION env variable when -Xmx is not provided. This is going to make kafka_connect_run.sh to skip -Xms, -Xmx dynamic calculation and allow JVM to calculate heap size according to -XX:{Min|Max}RAMPercentage allowing us to specify the memory limit in a single place.

Describe alternatives you've considered Currently we have to specify memory limits on both levels: the KafkaConnect custom resource level and the JVM level.

scholzj commented 3 years ago

There is a document with some proposed changes to CRDs: https://docs.google.com/document/d/1WoFcwN5bZfglX1BGZnch2RVpa-VBOsKkzpgPho6x1SQ/edit# ... I think the plan is to discuss it on Thursday next week in the Strimzi Community Meeting (https://docs.google.com/document/d/1V1lMeMwn6d2x1LKxyydhjo2c_IFANveelLD880A6bYc/edit). Maybe this is something what should be taken into account for these changes. Since it si probably not easy to change the current implementation without affecting existing users.

CC @tombentley

diranged commented 3 years ago

Any updates on this?

scholzj commented 3 years ago

No, we have not done any changes to this as part of the CRD changes. I do not think anyone joined the community call to discussed this particular issue as suggested in here at that time.

diranged commented 3 years ago

Ok. In the mean time, just to confirm, there is no way for us to adjust the DYNAMIC_HEAP_FRACTION?

scholzj commented 3 years ago

I think you should start by explaining your problem with the current settings. If you go through the code, for KafkaConnect, it IMHO uses the values from -XX:+UseContainerSupport. From the code it also looks like you can configure your own custom KAFKA_HEAP_OPTS environment variable which overrides this.

diranged commented 3 years ago

@scholzj, Sorry, you're right. The issue we ran into was that we found our KafkaConnect clusters were getting OOMed quite regularly (several times an hour). When we dug into it, we found that the -Xmx and -Xms settings used inside the containers were set to ~100.2% of the actual available Pod memory. No matter what we did to the pod memory, this happened. After a while we tracked it down to the fact that we were not setting our own Xmx and Xms settings, which triggered the dynamic_resources.sh code to take over:

$ k describe pod kafka-connect-s3-sink-connect-9f499898f-p45qc
Name:                 kafka-connect-s3-sink-connect-9f499898f-p45qc
Namespace:            xxx
Priority:             1000000
Priority Class Name:  default
...
    Limits:
      cpu:     1254m
      memory:  9148093649
    Requests:
      cpu:      627m
      memory:   9148093649

Now here is the JVM auto-determined memory setting:

[kafka@kafka-connect-s3-sink-connect-9f499898f-p45qc kafka]$ get_heap_size  1.0
9148280340
$  ps -ef | grep Xmx
kafka         66       1 29 13:48 ?        00:06:44 java -Xms9148280340 -Xmx9148280340 -server -XX:+UseG1GC ....

9148280340 is greater than 9148093649... which means we're telling the JVM to use more than 100% of the actual available memory.

[kafka@kafka-connect-s3-sink-connect-9f499898f-p45qc kafka]$ cat dynamic_resources.sh
set -e

function get_heap_size {
  FRACTION=$1
  MAX=$2
  # Get the max heap used by a jvm which used all the ram available to the container
  MAX_POSSIBLE_HEAP=$(java -XX:+UnlockExperimentalVMOptions -XX:+UseContainerSupport -XX:MaxRAMFraction=1 -XshowSettings:vm -version \
    |& awk '/Max\. Heap Size \(Estimated\): [0-9KMG]+/{ print $5}' \
    | gawk -f to_bytes.gawk)

  ACTUAL_MAX_HEAP=$(echo "${MAX_POSSIBLE_HEAP} ${FRACTION}" | awk '{ printf "%d", $1 * $2 }')

  if [ "${MAX}" ]; then
    MAX=$(echo "${MAX}" | gawk -f to_bytes.gawk)
    if [ "${MAX}" -lt "${ACTUAL_MAX_HEAP}" ]; then
      ACTUAL_MAX_HEAP=$MAX
    fi
  fi
  echo "$ACTUAL_MAX_HEAP"
}
[kafka@kafka-connect-s3-sink-connect-9f499898f-p45qc kafka]$ cat to_bytes.gawk
BEGIN {
  suffixes[""]=1
  suffixes["K"]=1024
  suffixes["M"]=1024**2
  suffixes["G"]=1024**3
}

match($0, /([0-9.]*)([kKmMgG]?)/, a) {
  printf("%d", a[1] * suffixes[toupper(a[2])])
}
$ java -XX:+UnlockExperimentalVMOptions -XX:+UseContainerSupport -XX:MaxRAMFraction=1 -XshowSettings:vm -version
OpenJDK 64-Bit Server VM warning: Option MaxRAMFraction was deprecated in version 10.0 and will likely be removed in a future release.
VM settings:
    Max. Heap Size (Estimated): 8.52G
    Using VM: OpenJDK 64-Bit Server VM

openjdk version "11.0.9.1" 2020-11-04 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.9.1+1-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.9.1+1-LTS, mixed mode, sharing)

Overall I totally support the idea of dynamically allocating the Java settings based on the container size.. but setting a JVM to ever use near 100% of the available container memory seems like a recipe for problems in my experience.

scholzj commented 3 years ago

But that is what I assume Java gave us?

CC @tombentley

tombentley commented 3 years ago

I guess we should always use min(limit, `get_heap_size 1.0`), because the excuse "this is what Java gave us" will cut no ice with the OOMKiller.

We need to deal with OpenJDK 64-Bit Server VM warning: Option MaxRAMFraction was deprecated in version 10.0 and will likely be removed in a future release. as well

diranged commented 3 years ago

@tombentley So - in my experience with Java and containers, you almost never want your max heap size to be anywhere near your container size. Just as an example, we set our Xmx to 2GB, and we're using an average of 2.4GB in our 4GB containers. I think the DYNAMIC_HEAP_FRACTION setting is great because it can let users use the VerticalPodAutoscaler if they want ... but I would recommend a more conservative setting of something like 0.85, and allow users to set it on their own. I tried setting it in the env key of the CR, but it was overridden and set to 1.0 anyways.

tombentley commented 3 years ago

@diranged well you never want your max heap size to be your limit anyway, since the JVM can uses plenty of other memory apart from its heap. That was the idea behind the MAX_POSSIBLE_HEAP calculation (see above): We ask the JVM how big the heap would be if we were to use all the RAM available to the container (-XX:MaxRAMFraction=1) then scale that by FRACTION to arrive at the actual heap size to request. For some containers we use FRACTION=1, but that should still result in the -Xmx that's sufficiently less than the container memory limit, assuming javac is giving accurate numbers. That was the theory anyway, but I guess it was optimistic to assume Max. Heap Size (Estimated) should mean "Max. Heap Size (Guaranteed)", which is, I guess, where this bug arises.

I'd be interested to know if you can spot a hole in the theory that it ought to be OK to dedicate all the container memory to java assuming you could had the option to actually control the jvm memory accurately.

Practically, I'm not sure whether a more conservative fraction is better than using the min I suggested above but with a more conservative value than container.limit. I guess it depends on whether javas error in the reported Max. Heap Size (Estimated) scales with heap size.

diranged commented 3 years ago

I'd be interested to know if you can spot a hole in the theory that it ought to be OK to dedicate all the container memory to java assuming you could had the option to actually control the jvm memory accurately.

So - if there is a way to ensure that Java never uses more than the available container memory, then that's fine.. I certainly agree that we should maximize the available memory for the application and waste as little as possible. I just haven't seen a situation where you can accurately get an explicit amount of total theoretical memory that the JVM will use outside of the main application space. Maybe that is something that can be calculated?

tombentley commented 3 years ago

About OpenJDK 64-Bit Server VM warning: Option MaxRAMFraction was deprecated in version 10.0 and will likely be removed in a future release. we need to use MaxRAMPercentage instead, apparently. MaxRAMFraction=n meant we were using 1/nth of the available RAM, so it was a blunt instrument for the sort of thing we're doing here.

Maybe that is something that can be calculated?

When trying to use all of the RAM we're treading a tightrope with the kernel's OOMKiller on the one side and the JVM's OutOfMemoryError on the other side. I guess we could now switch to something like MaxRAMPercentage=95, or even 99, given that the observation $ get_heap_size 1.0=9148280340 resulted was an over-estimate by 0.002%. Wdyt @diranged?

diranged commented 3 years ago

Using 99% works if you can figure out the "other ram" that Java needs to do its business. Like I said, in our example case here we're seeing ~20% on top of the Xmx setting being used for "other stuff in java". Whatever "better default" you pick, I'd like to make sure that the operator can easily tune that number to their own workloads.

tombentley commented 3 years ago

Well in theory that's what the MaxRAMPercentage and looking at the max heap estimate should account for. Ultimately people can always set Xmx explicitly anyway, though I'm sure some people assume that max heap==max rss and then get OOMKilled.

diranged commented 3 years ago

Can you point me to the documentation where you are reading up on MaxRAMPercentage? From what I am reading, MaxRAMPercentage is just a different way of expressing how much memory total to use out of the available cgroup memory from the container. I am not seeing anything that talks about taking into account additional java overhead.

Eg.. here are our live running containers. Notice their memory usage is ~300Mb above the 2GB mark?

kafka-connect-s3-sink-connect-74ffbf8d5d-2p6nw             kafka-connect-s3-sink-connect   762m         2373Mi          
kafka-connect-s3-sink-connect-74ffbf8d5d-8k7gk             kafka-connect-s3-sink-connect   664m         2368Mi          
kafka-connect-s3-sink-connect-74ffbf8d5d-psppp             kafka-connect-s3-sink-connect   737m         2381Mi          

Now let's look at the config of one container:

$ k describe pod kafka-connect-s3-sink-connect-74ffbf8d5d-psppp
Name:                 kafka-connect-s3-sink-connect-74ffbf8d5d-psppp
Namespace:            lightstream
Priority:             1000000
...
Containers:
  kafka-connect-s3-sink-connect:
    Container ID:  containerd://2feced257e007b6bf9f4b2705a54cffc3b16e232d7947808b4a86dca6da584d6
...
    Limits:
      cpu:     2
      memory:  4Gi
    Requests:
      cpu:      1
      memory:   4Gi
...
    Environment:
      KAFKA_HEAP_OPTS:                  -Xms2048m -Xmx2048m

So we have a heap hard-coded at 2GB, yet we're using 2.3GB of memory in the container. When we hop into that container and test the java ... commands:

$ k exec -ti kafka-connect-s3-sink-connect-74ffbf8d5d-psppp -- bash
[kafka@kafka-connect-s3-sink-connect-74ffbf8d5d-psppp kafka]$ java -XX:+PrintFlagsFinal -version | grep -Ei "maxheapsize|maxram"
   size_t MaxHeapSize                              = 1073741824                                {product} {ergonomic}
 uint64_t MaxRAM                                   = 137438953472                           {pd product} {default}
    uintx MaxRAMFraction                           = 4                                         {product} {default}
   double MaxRAMPercentage                         = 25.000000                                 {product} {default}

You can see the default behavior of MaxRAMPercentage=0.25 gives us a MaxHeapSize of 1GB... which is exactly 25% of the total container size. If you set the MaxRAMPercenage=1 then we end up with a MaxHeapSize of 4GB:

[kafka@kafka-connect-s3-sink-connect-74ffbf8d5d-psppp kafka]$ java -XX:+PrintFlagsFinal -XX:MaxRAMFraction=1 -version | grep -Ei "maxheapsize|maxram"
OpenJDK 64-Bit Server VM warning: Option MaxRAMFraction was deprecated in version 10.0 and will likely be removed in a future release.
   size_t MaxHeapSize                              = 4294967296                                {product} {ergonomic}
 uint64_t MaxRAM                                   = 137438953472                           {pd product} {default}
    uintx MaxRAMFraction                           = 1                                         {product} {command line}
   double MaxRAMPercentage                         = 100.000000                                {product} {default}

Based on what I am seeing here, there are no calculations in these numbers to help you avoid an OOM situation if you set your max heap size to exactly the same amount of memory your container's cgroup limit is set to.

    PID     TID CLS RTPRIO STAT    VSZ   RSS COMMAND
      1       1 TS       - Ss     4376   740 tini
     43      43 TS       - Sl   9535996 2452096 java
    592     592 TS       - Ss    11848  2996 bash
    663     663 TS       - R+    49628  3064 ps
[kafka@kafka-connect-s3-sink-connect-74ffbf8d5d-psppp kafka]$ 

Side Note: Using the raw byte numbers from the java commands I pasted above seems better and less error prone than trying to convert "human readable" memory sizes (4.1GB) into Bytes. Rounding errors are the reason we were actually seeing the heap size get set to GREATER than the total container size, rather than at least equaling it.

tombentley commented 3 years ago

Can you point me to the documentation where you are reading up on MaxRAMPercentage

I can't. This code was originally written around the time that -XX:+UseCGroupMemoryLimitForHeap was working its way into Java8, and it was all very poorly documented, so it was based on various blogs posts etc. Having done a bit of searching it seems that the "Java 8u131 and Java 9" section of https://merikan.com/2019/04/jvm-in-a-container/ represents the thinking at the time I implemented this. At some point we've switched to -XX:+UseContainerSupport and perhaps this explains the difference in behaviour, and so the old logic is now flawed, as you observe.

tombentley commented 3 years ago

Given that the JVM doesn't really let us do this any more, I guess the simplest thing to do would be something like Xmx = 85% of the cgroup limit as you suggest.

diranged commented 3 years ago

Given that the JVM doesn't really let us do this any more, I guess the simplest thing to do would be something like Xmx = 85% of the cgroup limit as you suggest.

:+1: .. but please make the setting customizable...

scholzj commented 2 years ago

Triaged on 12.4.2022: We now use Java 11 which has the container support enabled by default. The default percentage of used memory was changed to 75% in #6602 - this can be closed now.