stellio-hub / stellio-context-broker

Stellio is an NGSI-LD compatible context broker
https://stellio.readthedocs.io
Apache License 2.0
27 stars 11 forks source link

How to change jvm memory options to "-Xms1024m", "-Xmx1024m" from kubernetes deployment ? #1020

Closed agaldemas-eridanis closed 7 months ago

agaldemas-eridanis commented 11 months ago

We face out of memory heapsize errors in subscription-service, when under load :

WARN  c.e.s.s.s.EntityEventListenerService     - dispatchEntityEvent$suspendImpl - Entity event processing has failed: Java heap space
 java.lang.OutOfMemoryError: Java heap space
WARN  i.n.c.AbstractChannelHandlerContext      - invokeExceptionCaught - An exception 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception:
 java.lang.OutOfMemoryError: Java heap space

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "DefaultDispatcher-worker-6"

We have to start more pods to absorb the load, but manually, because lack of HPA...

So I'd like to give more memory to the jvm The only place I saw such configuration is in build.gradle.kts file for spring based services : jib.container.jvmFlags = listOf("-Xms256m", "-Xmx768m") (for susbcription service)

I wonder if it's possible to set a variable, in deployment to change this jvm option, without rebuilding the image ?

I have tried to set DEFAULT_JVM_OPTS, JVM_OPTS, without success !

It seems that there is no way except changing build.gradle.kts or gradlew, and rebuild...

Any idea to workaround and be able to change jvm options before starting the container ?

bobeal commented 11 months ago

Hi @agaldemas-eridanis,

Can you try using JAVA_TOOL_OPTIONS instead? (see https://github.com/GoogleContainerTools/jib/blob/master/docs/faq.md#how-do-i-set-parameters-for-my-image-at-runtime)

agaldemas-eridanis commented 11 months ago

Hello @bobeal, I have tried with JAVA_TOOL_OPTIONS, and JDK_JAVA_OPTIONS

but, bad luck, no effect on jvm memory :( ...

what is strange is, in the pod :

root@subscription-service-545759b588-ghxmv:/# ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 142 1.4 6101056 466244 ? Ssl 16:50 0:27 java -Xms256m -Xmx768m -cp @/app/jib-classpath-file com.egm.stellio.subscription.SubscriptionServiceApplic

=> still "-Xms256m -Xmx768m"

But :

root@subscription-service-545759b588-ghxmv:/# java -version NOTE: Picked up JDK_JAVA_OPTIONS: -Xms512m -Xmx1024m Picked up JAVA_TOOL_OPTIONS: -Xms512m -Xmx1024m openjdk version "17.0.6" 2023-01-17 OpenJDK Runtime Environment Temurin-17.0.6+10 (build 17.0.6+10) OpenJDK 64-Bit Server VM Temurin-17.0.6+10 (build 17.0.6+10, mixed mode, sharing)

However the variables set : root@subscription-service-545759b588-ghxmv:/# env|grep JAVA JAVA_HOME=/opt/java/openjdk JAVA_TOOL_OPTIONS=-Xms512m -Xmx1024m JDK_JAVA_OPTIONS=-Xms512m -Xmx1024m JAVA_VERSION=jdk-17.0.6+10

I give my tongue to the cat ;O) or jib ?

bobeal commented 11 months ago

Hmm, I think it actually worked.

What you see with a ps aux is the startup command as defined in the Docker image, not the actual (dynamic) memory configuration.

For instance, I launched api-gateway with JAVA_TOOL_OPTIONS="-Xmx2048m" and:

➜  docker run --rm -ti -p 8080:8080 -e JAVA_TOOL_OPTIONS="-Xmx2048m" stellio/stellio-api-gateway:latest-dev
Picked up JAVA_TOOL_OPTIONS: -Xmx2048m
[...]

If I do a ps aux in the container, I still see the "original" command:

root@db71ca92b68e:/# ps auxww
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1 19.4  2.2 3905496 234016 pts/0  Ssl+ 06:42   0:09 java -Xms64m -Xmx128m -cp @/app/jib-classpath-file com.egm.stellio.apigateway.ApiGatewayApplicationKt
root        43  0.0  0.0   7144  3584 pts/1    Ss   06:43   0:00 bash
root        51  0.0  0.0   9420  2816 pts/1    R+   06:43   0:00 ps auxww
agaldemas-eridanis commented 11 months ago

Hello @bobeal,

Humm, I'd like to trust you, but still we face "java.lang.OutOfMemoryError: Java heap space"

ps auxww USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 310 3.7 6621312 1234716 ? Ssl 07:33 61:06 java -Xms256m -Xmx768m -cp @/app/jib-classpath-file com.egm.stellio.subscription.SubscriptionServiceApplicationKt

and with java -version :

NOTE: Picked up JDK_JAVA_OPTIONS: -Xmx2048m Picked up JAVA_TOOL_OPTIONS: -Xmx2048m openjdk version "17.0.6" 2023-01-17 OpenJDK Runtime Environment Temurin-17.0.6+10 (build 17.0.6+10) OpenJDK 64-Bit Server VM Temurin-17.0.6+10 (build 17.0.6+10, mixed mode, sharing)

and top :

top - 07:56:22 up 74 days, 13:07,  0 users,  load average: 9.51, 11.32, 12.01
Tasks:   7 total,   1 running,   6 sleeping,   0 stopped,   0 zombie
%Cpu(s): 79.2 us,  7.2 sy,  0.0 ni, 13.3 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
MiB Mem :  32100.3 total,   1294.4 free,  17831.2 used,  12974.7 buff/cache
MiB Swap:   4096.0 total,    788.5 free,   3307.5 used.  14017.0 avail Mem 

PID USER      PR  NI    VIRT          RES    SHR    S  %CPU  %MEM     TIME+ COMMAND                                                         
1  root    20   0     6621312   1.2g    26780 S 556.7   3.8             80:43.63 java

after the memory error occurs, the java application does not work anymore...

So I don't know what to do, on my side, except trying to activate probes and HPA, to see if we can workaround the problem.

But I noticed the jvm memory configuration in build.gradle.kts for subscription-service ("-Xms256m", "-Xmx768m") is lower than search-service ("-Xms512m", "-Xmx1024m"), is it possible that you align the configurations to : jib.container.jvmFlags = listOf("-Xms512m", "-Xmx1024m"),

If you can change this for search-service, it could be nice, thanks, or tell me if you want a PR ?

agaldemas-eridanis commented 11 months ago

Hello, If you put the probes back in, the subscription-service pods are restarted periodically by kubernetes, and even with 6 replicas, to absorb the traffic, the heapsize error always comes back...

OK, it works, but a bit dirty !

Think, that if you can do the change to align subscription-service memory config with search-service, it would be the best first solution

bobeal commented 11 months ago

Until we find a more reliable solution, I published a new Docker image for subscription-service with the same memory settings than search-service: https://hub.docker.com/layers/stellio/stellio-subscription-service/latest-dev-memory/images/sha256-2805578c9babb83a04d0b3ab2a712e58b905aba06a3cb3e84ea5f173a18f1a70?context=explore

agaldemas-eridanis commented 11 months ago

OK, thanks @bobeal, I try it now !

agaldemas-eridanis commented 11 months ago

Bad luck, will take bit of time, since I've to upgrade to 2.7.0 or 2.8.0 (as still in 2.3.0), come back soon

bobeal commented 11 months ago

You can also easily generate our own image:

./gradlew jib -Djib.to.image=organization/repository:2.3.0-memory -p subscription-service
agaldemas-eridanis commented 11 months ago

Thanks for the suggestion, it will avoid the need of timescale upgrade...

agaldemas-eridanis commented 11 months ago

Thanks for the suggestion, it will avoid the need of timescale upgrade...

by the way, I'm having a problem upgrading Timescale there is an error :

2023-10-17 14:25:24.288 UTC [250] ERROR:  could not access file "$libdir/timescaledb-2.9.1": No such file or directory
2023-10-17 14:25:24.288 UTC [250] CONTEXT:  SQL statement "update cron.job_run_details set status = 'failed', return_message = 'server restarted' where status in ('starting','running')"
2023-10-17 14:25:24.291 UTC [236] LOG:  background worker "pg_cron launcher" (PID 250) exited with exit code 1

As I wanted too keep the database, there is a "cron.job_run_details" job which wants to use the old timescaledb-2.9.1 lib, so I think a special issue is needed for this !.

bobeal commented 11 months ago

Thanks for the suggestion, it will avoid the need of timescale upgrade...

Let me know if it fixes the memory problem.

bobeal commented 11 months ago

Thanks for the suggestion, it will avoid the need of timescale upgrade...

by the way, I'm having a problem upgrading Timescale there is an error :

2023-10-17 14:25:24.288 UTC [250] ERROR:  could not access file "$libdir/timescaledb-2.9.1": No such file or directory
2023-10-17 14:25:24.288 UTC [250] CONTEXT:  SQL statement "update cron.job_run_details set status = 'failed', return_message = 'server restarted' where status in ('starting','running')"
2023-10-17 14:25:24.291 UTC [236] LOG:  background worker "pg_cron launcher" (PID 250) exited with exit code 1

As I wanted too keep the database, there is a "cron.job_run_details" job which wants to use the old timescaledb-2.9.1 lib, so I think a special issue is needed for this !.

Did you follow the instructions from https://stellio.readthedocs.io/en/latest/admin/upgrading_to_2.7.0.html? (and yes please open a new issue if you have a problem upgrading)

agaldemas-eridanis commented 10 months ago

Hello, Yes but afterwards..., I did it manually in psql, from the pod, and the 2 databases went OK ! I wanted to go forward by doing it through kubectl exec, but I faced another issue when I started the search-service I make another issue to speak about those problems of migration :

The fact that the subscription-service was making out-of memory errors, was because of updating several hundreds of entities each minute after reading in a database, generating a huge quantity of notifications to post, then also the size of the database grew up at incredible speed...

We have added a cache, since, to avoid unuseful update on entities that have not changed, now it's far much better....

with the cache, the subscription-service, works fine in 2.3.0, with the initial memory configuration

agaldemas-eridanis commented 10 months ago

the issue is not solved at all, since there's an issue with stellio 2.7.0, which reject '@' characters in @type and @value, we cannot use it, for the moment. then we are still using 2.3.0,

We reproduced the issue this morning, with a lot of queued notification requests in kafka, when you start the subscription service, the pods fails with heap size memory error... When we add too many pods with the 2.3.0, we get database connection issues...

Then we need first to change the memory configuration of the subscription service, by building a new image for the 2.3 !

and may be we'll have to add more connections in the pool ???, but may be the issues are due to deadlocks generated by crash of subscription service...

I'll keep you informed...

agaldemas-eridanis commented 10 months ago

OK, I managed to build a new image in 2.3.0, on my personal dockerhub repo

archietuttle/subscription-service:2.3.0-memory

memory configuration is now: jib.container.jvmFlags = listOf("-Xms512m", "-Xmx1536m")

we'll retry tomorrow to create the same conditions, we need too fill kafka queue while subscription-service is stopped, and re-starting it when many notifications are to be consumed... to check if the out of memory error doesn't occurs...

bobeal commented 10 months ago

the issue is not solved at all, since there's an issue with stellio 2.7.0, which reject '@' characters in @type and @value, we cannot use it, for the moment. then we are still using 2.3.0

We added some checks (from the specification) related to supported entity content and name. It may have introduced some unwanted side effect. Do you have a sample payload that is rejected?

bobeal commented 10 months ago

The fact that the subscription-service was making out-of memory errors, was because of updating several hundreds of entities each minute after reading in a database, generating a huge quantity of notifications to post, then also the size of the database grew up at incredible speed...

We have added a cache, since, to avoid unuseful update on entities that have not changed, now it's far much better....

Don't really get the thing here. You mean you were sending updates of entities but updated entities had no difference with existing entities?

agaldemas-eridanis commented 10 months ago

Hello, even me I didn't understood, but there was a nifi flow file (a kind of death pipeline), which was reading every second, entities, for nothing..., now it's OK, the traffic is reasonable.

the issue is not solved at all, since there's an issue with stellio 2.7.0, which reject '@' characters in @type and @value, we cannot use it, for the moment. then we are still using 2.3.0

We added some checks (from the specification) related to supported entity content and name. It may have introduced some unwanted side effect. Do you have a sample payload that is rejected?

my colleague @JTrullier shall open an issue with this problem we discovered

agaldemas-eridanis commented 10 months ago

As we have not a working solution, without building a specific image, I propose to keep this issue open !

May it's an issue with jib version or configuration ?

bobeal commented 10 months ago

Yes, I'll search for a solution that allow dynamic configuration of the memory.

We are already using the last version of Jib. So probably something wrong or missing in the configuration.

bobeal commented 10 months ago

Was doing some other tests.

Launched one service with this additional environment variable:

      - JDK_JAVA_OPTIONS="-Xmx2048m"

If I connect to the container and check the JVM settings:

root@5d64df458ef1:/# java -XX:+PrintFlagsFinal -version | grep HeapSize
NOTE: Picked up JDK_JAVA_OPTIONS: "-Xmx2048m"
   size_t ErgoHeapSizeLimit                        = 0                                         {product} {default}
   size_t HeapSizePerGCThread                      = 43620760                                  {product} {default}
   size_t InitialHeapSize                          = 163577856                                 {product} {ergonomic}
   size_t LargePageHeapSizeThreshold               = 134217728                                 {product} {default}
   size_t MaxHeapSize                              = 2147483648                                {product} {command line}
   size_t MinHeapSize                              = 8388608                                   {product} {ergonomic}
    uintx NonNMethodCodeHeapSize                   = 5839564                                {pd product} {ergonomic}
    uintx NonProfiledCodeHeapSize                  = 122909338                              {pd product} {ergonomic}
    uintx ProfiledCodeHeapSize                     = 122909338                              {pd product} {ergonomic}
   size_t SoftMaxHeapSize                          = 2147483648                             {manageable} {ergonomic}
openjdk version "17.0.8.1" 2023-08-24
OpenJDK Runtime Environment Temurin-17.0.8.1+1 (build 17.0.8.1+1)
OpenJDK 64-Bit Server VM Temurin-17.0.8.1+1 (build 17.0.8.1+1, mixed mode, sharing)

Also tried the other way (reducing max heap size to 512m) and seems it is also OK:

root@8b4990b79b72:/# java -XX:+PrintFlagsFinal -version | grep HeapSize
NOTE: Picked up JDK_JAVA_OPTIONS: -Xms128m -Xmx512m
   size_t ErgoHeapSizeLimit                        = 0                                         {product} {default}
   size_t HeapSizePerGCThread                      = 43620760                                  {product} {default}
   size_t InitialHeapSize                          = 134217728                                 {product} {command line}
   size_t LargePageHeapSizeThreshold               = 134217728                                 {product} {default}
   size_t MaxHeapSize                              = 536870912                                 {product} {command line}
   size_t MinHeapSize                              = 134217728                                 {product} {command line}
    uintx NonNMethodCodeHeapSize                   = 5839564                                {pd product} {ergonomic}
    uintx NonProfiledCodeHeapSize                  = 122909338                              {pd product} {ergonomic}
    uintx ProfiledCodeHeapSize                     = 122909338                              {pd product} {ergonomic}
   size_t SoftMaxHeapSize                          = 536870912                              {manageable} {ergonomic}
openjdk version "17.0.8.1" 2023-08-24
OpenJDK Runtime Environment Temurin-17.0.8.1+1 (build 17.0.8.1+1)
OpenJDK 64-Bit Server VM Temurin-17.0.8.1+1 (build 17.0.8.1+1, mixed mode, sharing)

That being said, I'm thinking about totally removing the default memory config for heap size in the build files. Then containers will use default settings for the JVM (see https://learn.microsoft.com/en-us/azure/developer/java/containers/overview#understand-jvm-default-ergonomics for some numbers that are valid for Eclipse Temurin 17). And users will then override according to their use-cases. Seems sensible?

bobeal commented 7 months ago

Finally about to publish Docker images without any JVM memory settings in #1086.

Did an experiment on a DB filled with entities having a lot of history and doing a retrieve temporal entity asking for the whole history (in parallel with a configurable number of users). Without any specific memory configuration, it crashed at some point with OutOfMemoryError errors. Added JDK_JAVA_OPTIONS=-Xms4g -Xmx6g and it passed without any errors. Seems it is well taken into account.