Open MichalPopielski opened 2 years ago
We are testing JAVA_TOOL_OPTIONS
set to -XX:+UseG1GC -XX:+AlwaysActAsServerClassMachine
for AHKQ hosted on our AKS environment.
So far it's working good for few days straight.
We have used this trick for other pods using java apps as well which faced similar OOM issues. Seems that it might help in this case as well.
Root cause: If there're less than 2 CPUs (in our case limit is less than 1 CPU) then serial GC is being used as default one. It seems that it's not good enough and ends up with OOM issues when handling heavy memory usage.
@MichalPopielski : thanks for the information, it can be a good clue!
We too have frequently this issue, G1HC seems to be a better option for akhq and Java 17 greatly improve it, with a 20% reduction in memory footprint. @tchiotludo would you merge a PR that upggrade to Java 17 ? I can work on it.
Also, the issue is on direct memory (out of heap space) so it's strange that using G1GC solves the issue. More memory should be allocated but not more heap to let space for direct memory buffers.
As the container image is JRE it may be hard to investiguate as we misses JDK tools in the image to look at what's going on, I also would like to propose to switch to JDK images instead of JRE one, again I can do this in a PR to migrate to Java 17.
By the way, @MichalPopielski -XX:+AlwaysActAsServerClassMachine
should not be needed if the heap size is more than 1GB as the JVM will switch to server class machine automatically in this case.
@loicmathieu java 17 could make sense I think, go for it. For the jdk one, it will make the image really heavy, why not published on alternative image with jdk that we will ask people to use on debugging purpose ?
Dear All, New OOO error, occurred once again. To solve it we had to restart the whole pod, otherwise users were receiving grey background only. All Java Options were set as in my previous post.
Health metric monitoring such case could help to automate the restart of AKHQ which ran out of memory
We are also seeing OOM exceptions which kill the background processes but does not cause a health check to fail. This causes our monitoring to think akhq is still available and blocks auto healing of the pod in kubernetes. It would be great if the app crashes or at least has a failing health check when only the frontend can be served.
Whenever user starts looking at the topic, the records are being tailed. Recently one of the users started to check data from topic which has really high throughput. My assumption is that AKHQ reaches the memory limit and afterwards users are not able to use AKHQ as it throws OutOfMemoryError:
I tried to increase memory and heap size, but it didn't help (up to 14GB). Any ideas how can I limit the infinite tailing of a topic?