vaadin / kubernetes-kit

Other
2 stars 2 forks source link

Memory Leak Causing Pod Shutdowns #115

Open MichaelPluessErni opened 5 months ago

MichaelPluessErni commented 5 months ago

Error description:

Since introducing the session serialization, we are experiencing multiple pod kills in our environment. The Vaadin application is running in production and the pods on which it runs are regularly reaching the memor limit of 3gb, causing the pods to be shut down. Sometimes they reach the limit in as little as 4 hours.

This only happens since we use the session serialization with the kubernetes kit. Before that, the memory usage would go down during the afternoon and evening, when we have less traffic.

It looks as though there is a memory leak somewhere, as the memory is not cleared properly.

We tried using a different garbage collector. This helped a bit, but did not solve the problem.

Expected behaviour:

The memory usage should not be permanently affected by the session serialization. While it is clear that there will be more memory usage during the serialization itself, it should not cause lasting memory leaks. Errors during serialization or deserialization should not cause memory leaks.

Details:

A comparison of the memory usage of our pods before and after the introduction of the session serialization (session serialization was introduced on the 3rd of March):

image

It is visible from the logs, that the frequency with which pods are killed has increased drastically:

image (3)

The memory leaks seem to happen in "jumps".

image (1)

This pod is soon going to be killed after one more memory leak event. Logically there are more memory leak events occuring during times of higher usage (in our case, during working hours).

image (2)

The new garbage collector does not solve the problem:

image (4)

heruan commented 5 months ago

Thanks for reporting! We are investigating on this. Can you provide an estimate of the session size when serialization happens, or a project replicating the issue?

anderslauri commented 5 months ago

Thanks for reporting! We are investigating on this. Can you provide an estimate of the session size when serialization happens, or a project replicating the issue?

Hi,

I work at the same client as @MichaelPluessErni - let me append some details here. The first image is a time vector of jvm_classes_unloaded_classes_total - delta between usage of Kubernetes Kit and prior is clearly visible. The second image below is showing jvm_classes_loaded_classes. I could assume with serialization to Redis, classes are discarded and with deserialization from Redis new classes are loaded. Forcing old gen increase and GC to work harder. We have stablized memory with G1 given some tuning, however, this data suggest something is not optimal. Perhaps a class pool can be used to dampen these numbers and GC.

grafik grafik

MichaelPluessErni commented 5 months ago

@heruan Based on my rough estimations, a single VaadinSession seems to be about 7kb big. Sadly it is difficult to provide a sample project with this error as our application is rather large and I do not know what is causing the issue. Thus I cannot replicate it in a sample project. However, if this is a general issue, this error should appear in any project that uses Redis + session replication.

mcollovati commented 5 months ago

@MichaelPluessErni @anderslauri could you be able to keep a couple of heap dumps and compare them to check what objects are actually making the memory usage grow? This would help a lot in the investigation.

anderslauri commented 5 months ago

@MichaelPluessErni @anderslauri could you be able to keep a couple of heap dumps and compare them to check what objects are actually making the memory usage grow? This would help a lot in the investigation.

Yes, this is possible. Let us do this.

mcollovati commented 5 months ago

@MichaelPluessErni @anderslauri and additional question: which driver are you using to connect to Redis, Lettuce or Jedis? Did you perhaps try to change the driver to verify that the leak is independent of it?

EDIT: looking at the other issues, it looks like Lettuce is in use

MichaelPluessErni commented 5 months ago

@mcollovati We're using: redis.clients.jedis 5.0.2 io.lettuce.lettuce-core 6.3.1.RELEASE

It is not easy to test other versions as the bug appears on the productive system.

MichaelPluessErni commented 5 months ago

@mcollovati I'm now able to produce heap dumps and analyze them.

A heap dump from one of our productive pods: 477 MB (hprof file size)

Summary: grafik

Classes by size of instances: grafik

Dominators by retained size: grafik

I hope this helps already. Otherwise I'm available for more specific analyses on heap dumps.

mcollovati commented 5 months ago

@MichaelPluessErni thank you very much!

Is this dump taken when memory is already leaked? If so, I would also take a dump before the memory grows to compare them.

Memory Analyzer (MAT) is a great tool to inspect heap dumps. It also provides a Leak Suspects report that may help in the investigation (although I don't remember if the report can be exported).

Otherwise, if you can privately share the dump with me, I can do further analysis.

MichaelPluessErni commented 5 months ago

@mcollovati this dump is pre-leak, meaning from a "healthy" pod.

MichaelPluessErni commented 5 months ago

@mcollovati Using MAT proves difficult, as I'm not able to download it on the company laptop. We're investigating whether it is possible to send you the dump.

Meanwhile, we've found a GC configuration that helps ameliorate the memory leak:

// These settings define the following:
// InitiatingHeapOccupancyPercent  = 30     (default 45). Once the heap occupancy percentage for old gen is above 30% - Java will begin marking objects for GC.
// G1MixedGCLiveThresholdPercent   = 85     (default 85). Only old gen regions with lower live occupancy than configured value is collected is space reclamation phase.
// G1OldCSetRegionThresholdPercent = 25     (default 10). Up to this percentage is reclaimed for old gen in GC-cycle.
"application.JDK_JAVA_OPTIONS": "\"-XX:+UnlockExperimentalVMOptions -XX:InitiatingHeapOccupancyPercent=30 -XX:G1MixedGCLiveThresholdPercent=85 -XX:G1OldCSetRegionThresholdPercent=25\"",
// Reduce MAX_RAM_PERCENTAGE from 80% to 60%. Given 3100mb this represents 1860mb. Should be more than enough.
"application.MAX_RAM_PERCENTAGE": "60",
"application.INITIAL_RAM_PERCENTAGE": "60",

gc_helps_memory_leak