Open ahus1 opened 4 weeks ago
I think I was able to reproduce the issue. I tested this with 3-pod KC 26.0.5 cluster with 8 VCPUs and 2GB RAM per pod.
I increased memory pressure by:
persistent-user-sessions
feature (keeping sessions in memory)This resulted in the whole KC cluster being restarted due to health probe failures.
GC overhead reached peaks of about 2.6 %, JVM memory after GC was around 40-50%, and pod memory reached over 90% utilization.
However, there still weren't any "G1 Evacuation Pause" to-space exhausted
or to-space overflow
messages in the logs which would indicate that the GC is running out of memory (as mentioned in G1 GC tuning guide).
I haven't found any straightforward way to trigger an OOME. I need to read up a bit on how G1 can handle these situations.
This is pretty much what we also saw. @ahus1 mentioned a setting where you could tell the GC to cause an OOME if it not able to free enough memory or takes too much time (-XX:GCTimeLimit
and XX:GCHeapFreeLimit
) but it seems they are for ParallelGC only.
@tkyjovsk - the "connection refused" might be the message after the Pod restart, when the monitoring port is not yet available. Did you also see a "request timeout" or similar, indicating a slow response which would be caused by a high GC overhead? Please also confirm that there was no OOM kill from Kubernetes.
To decide on the next steps, it would be good to collect all information about the behavior we see in this issue. If the behavior is good enough (Pods eventually restart and the service becomes available again), there might be no action items. If we think it is not good enough (recovery takes too long, all Pods restart, etc.), then we can discuss what next steps we want to take.
Can you sum up this information for one of our next team meeting? Today might be a bit short-notice - does Monday sound feasible?
@ahus1 Looks like there actually have been OOM kills from Kubernetes. I can see these for all KC pods:
status:
...
containerStatuses:
- containerID: cri-o://37aefee3497283caafe46d2e145bfc284d1ad0992f49a4d9506fc7a685365ae0
image: quay.io/keycloak/keycloak:26.0.5
imageID: quay.io/keycloak/keycloak@sha256:089b5898cb0ba151224c83aef3806538582880029eb5ea71c2afd00b627d4907
lastState:
terminated:
containerID: cri-o://ff70d7b7174a6b676df209d46e790c8d1246a64fd1b6c5b6f0c9378e3567eed8
exitCode: 137
finishedAt: "2024-11-05T17:13:01Z"
reason: OOMKilled
startedAt: "2024-11-05T16:47:04Z"
This would mean that the JVM is not keeping within the OS/container memory limit, and the probe failures are only a secondary symptom.
@tkyjovsk are you setting the memory request/limit in the Keycloak CR?
@tkyjovsk How much non-heap non-metaspace memory do you give to the JVM i.e. how big is the difference between pod memory limit and java heap+metaspace max?
@tkyjovsk - when running in a container, we set
-XX:MaxRAMPercentage=70 -XX:MinRAMPercentage=70 -XX:InitialRAMPercentage=50
which will allocate a percentage of the container's RAM to the heap, see this code for the details.
The "RAM Percentage" sizes the heap memory of the JVM as a percentage, but the JVM will continue to use additional pieces of memory for the Java byte code, threads and other non-heap memory elements.
The more memory you allocate to your Pod, the more memory the JVM will have as a left-over 30% from the setting above. I remember we tested this but it might have either changed over time, or we didn't push the test to its limits.
I suggest you set the memory request/limit to a higher value, and then configure -XmsXXm -XmxXXm
explicitly, and then re-run the test. I'm curious how much non-heap memory Keycloak is then using in your test.
@tkyjovsk are you setting the memory request/limit in the Keycloak CR?
My theory when I wrote this comment is that the pod has unlimited memory. The JVM will compute a high max heap size, which leads to Kubernetes killing the pod when the worker node has run out of memory.
You can check the max heap size computed by running the following command in the pod
java -XX:+PrintFlagsFinal -XX:MaxRAMPercentage=70 -XX:MinRAMPercentage=70 -XX:InitialRAMPercentage=50 -version | grep HeapSize
In our deployment, for the following limits
containers:
- resources:
limits:
memory: 4G
requests:
cpu: '6'
memory: 3G
It returns the following (max heap size of ~2.8GB):
size_t ErgoHeapSizeLimit = 0 {product} {default}
size_t HeapSizePerGCThread = 43620760 {product} {default}
size_t InitialHeapSize = 2000683008 {product} {ergonomic}
size_t LargePageHeapSizeThreshold = 134217728 {product} {default}
size_t MaxHeapSize = 2801795072 {product} {ergonomic}
size_t MinHeapSize = 8388608 {product} {ergonomic}
uintx NonNMethodCodeHeapSize = 5839564 {pd product} {ergonomic}
uintx NonProfiledCodeHeapSize = 122909338 {pd product} {ergonomic}
uintx ProfiledCodeHeapSize = 122909338 {pd product} {ergonomic}
size_t SoftMaxHeapSize = 2801795072 {manageable} {ergonomic}
openjdk version "21.0.5" 2024-10-15 LTS
OpenJDK Runtime Environment (Red_Hat-21.0.5.0.10-1) (build 21.0.5+10-LTS)
OpenJDK 64-Bit Server VM (Red_Hat-21.0.5.0.10-1) (build 21.0.5+10-LTS, mixed mode, sharing)
I don't think computing the heap memory as MaxRamPercentage is a good way to go. In my experience, the amount of non-Heap non-Metaspace memory you need is somewhat constant. We use 700MB for it and we know with this setting we never get an OOMKill. With MaxRamPercentage, the size of this area depends on the pod memory limit. Which means with small pod memory, you get OOMKills, with large enough pod memory, you won't see them.
Thanks for the feedback. Let me try and emulate the second situation then (trying to avoid the OOM kill).
I think I was able to reproduce the issue. I bumped the memory limits for pods to 4G while keeping the JVM heap limit at the original 1.4G (70% of the original 2G pod memory). This setting avoided the OOM kill and eventually ran the JVM into 70-90% GC overhead, showing messages like:
[2699.249s][info][gc] GC(6940) Pause Young (Normal) (G1 Evacuation Pause) (Evacuation Failure) 1409M->1368M(1434M) 3.549ms
[2699.339s][info][gc] GC(6941) Pause Young (Normal) (G1 Evacuation Pause) (Evacuation Failure) 1420M->1381M(1434M) 2.888ms
[2699.439s][info][gc] GC(6942) Pause Young (Normal) (G1 Evacuation Pause) (Evacuation Failure) 1423M->1387M(1434M) 2.329ms
[2699.488s][info][gc] GC(6943) Pause Young (Normal) (G1 Evacuation Pause) (Evacuation Failure) 1425M->1421M(1434M) 6.626ms
[2699.491s][info][gc] GC(6944) Pause Young (Normal) (G1 Evacuation Pause) (Evacuation Failure) 1426M->1429M(1434M) 2.728ms
This resulted in a large percentage of failed requests. The liveness and readiness probe failures eventually rebooted the pods but only after several minutes of failing.
The GCTimeRatio
option was set to default value 12
which should correspond to a target of ~8% GC overhead. It is clear from the test result that this overhead was much larger and for a considerable amount of time.
In a state like this I would expect the GC to throw OutOfMemoryError: GC overhead limit exceeded
. I'm wondering whether this can be considered as a faulty behavior in the GC itself. Should we maybe report this as a possible JVM bug? @ahus1
@tkyjovsk - thank you for this detailed analysis and for capturing all these details.
Yes, please create a bug issue against the people OpenJDK at Red Hat in their JIRA. It might then be a problem on our side that we are missing out on a required configuration, or we will get a recommendation to use a different GC (ZGC).
Let's see what our friends in the OpenJDK team at Red Hat come up with ideas and suggestions.
Community is reporting that with G1GC there is no out-of-memory exception / JVM shutdown when the Keycloak heap increases - test originally with Keycloak 24 The scenario was that a lot of session were used here. For KC26, maybe creating a lot of authentication sessions can be used, or setting the heap size to a very small value.
Originally reported by @sschu