Closed agaldemas-eridanis closed 9 months ago
Hi @agaldemas-eridanis,
Can you try using JAVA_TOOL_OPTIONS
instead? (see https://github.com/GoogleContainerTools/jib/blob/master/docs/faq.md#how-do-i-set-parameters-for-my-image-at-runtime)
Hello @bobeal, I have tried with JAVA_TOOL_OPTIONS,
and JDK_JAVA_OPTIONS
but, bad luck, no effect on jvm memory :( ...
what is strange is, in the pod :
root@subscription-service-545759b588-ghxmv:/# ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 142 1.4 6101056 466244 ? Ssl 16:50 0:27 java -Xms256m -Xmx768m -cp @/app/jib-classpath-file com.egm.stellio.subscription.SubscriptionServiceApplic
=> still "-Xms256m -Xmx768m"
But :
root@subscription-service-545759b588-ghxmv:/# java -version NOTE: Picked up JDK_JAVA_OPTIONS: -Xms512m -Xmx1024m Picked up JAVA_TOOL_OPTIONS: -Xms512m -Xmx1024m openjdk version "17.0.6" 2023-01-17 OpenJDK Runtime Environment Temurin-17.0.6+10 (build 17.0.6+10) OpenJDK 64-Bit Server VM Temurin-17.0.6+10 (build 17.0.6+10, mixed mode, sharing)
However the variables set : root@subscription-service-545759b588-ghxmv:/# env|grep JAVA JAVA_HOME=/opt/java/openjdk JAVA_TOOL_OPTIONS=-Xms512m -Xmx1024m JDK_JAVA_OPTIONS=-Xms512m -Xmx1024m JAVA_VERSION=jdk-17.0.6+10
I give my tongue to the cat ;O) or jib ?
Hmm, I think it actually worked.
What you see with a ps aux
is the startup command as defined in the Docker image, not the actual (dynamic) memory configuration.
For instance, I launched api-gateway with JAVA_TOOL_OPTIONS="-Xmx2048m"
and:
➜ docker run --rm -ti -p 8080:8080 -e JAVA_TOOL_OPTIONS="-Xmx2048m" stellio/stellio-api-gateway:latest-dev
Picked up JAVA_TOOL_OPTIONS: -Xmx2048m
[...]
If I do a ps aux
in the container, I still see the "original" command:
root@db71ca92b68e:/# ps auxww
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 19.4 2.2 3905496 234016 pts/0 Ssl+ 06:42 0:09 java -Xms64m -Xmx128m -cp @/app/jib-classpath-file com.egm.stellio.apigateway.ApiGatewayApplicationKt
root 43 0.0 0.0 7144 3584 pts/1 Ss 06:43 0:00 bash
root 51 0.0 0.0 9420 2816 pts/1 R+ 06:43 0:00 ps auxww
Hello @bobeal,
Humm, I'd like to trust you, but still we face "java.lang.OutOfMemoryError: Java heap space"
ps auxww USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 310 3.7 6621312 1234716 ? Ssl 07:33 61:06 java -Xms256m -Xmx768m -cp @/app/jib-classpath-file com.egm.stellio.subscription.SubscriptionServiceApplicationKt
and with java -version :
NOTE: Picked up JDK_JAVA_OPTIONS: -Xmx2048m Picked up JAVA_TOOL_OPTIONS: -Xmx2048m openjdk version "17.0.6" 2023-01-17 OpenJDK Runtime Environment Temurin-17.0.6+10 (build 17.0.6+10) OpenJDK 64-Bit Server VM Temurin-17.0.6+10 (build 17.0.6+10, mixed mode, sharing)
and top :
top - 07:56:22 up 74 days, 13:07, 0 users, load average: 9.51, 11.32, 12.01
Tasks: 7 total, 1 running, 6 sleeping, 0 stopped, 0 zombie
%Cpu(s): 79.2 us, 7.2 sy, 0.0 ni, 13.3 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st
MiB Mem : 32100.3 total, 1294.4 free, 17831.2 used, 12974.7 buff/cache
MiB Swap: 4096.0 total, 788.5 free, 3307.5 used. 14017.0 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 6621312 1.2g 26780 S 556.7 3.8 80:43.63 java
after the memory error occurs, the java application does not work anymore...
So I don't know what to do, on my side, except trying to activate probes and HPA, to see if we can workaround the problem.
But I noticed the jvm memory configuration in build.gradle.kts
for subscription-service
("-Xms256m", "-Xmx768m") is lower than search-service
("-Xms512m", "-Xmx1024m"), is it possible that you align the configurations to :
jib.container.jvmFlags = listOf("-Xms512m", "-Xmx1024m"),
If you can change this for search-service, it could be nice, thanks, or tell me if you want a PR ?
Hello, If you put the probes back in, the subscription-service pods are restarted periodically by kubernetes, and even with 6 replicas, to absorb the traffic, the heapsize error always comes back...
OK, it works, but a bit dirty !
Think, that if you can do the change to align subscription-service memory config with search-service, it would be the best first solution
Until we find a more reliable solution, I published a new Docker image for subscription-service with the same memory settings than search-service: https://hub.docker.com/layers/stellio/stellio-subscription-service/latest-dev-memory/images/sha256-2805578c9babb83a04d0b3ab2a712e58b905aba06a3cb3e84ea5f173a18f1a70?context=explore
OK, thanks @bobeal, I try it now !
Bad luck, will take bit of time, since I've to upgrade to 2.7.0 or 2.8.0 (as still in 2.3.0), come back soon
You can also easily generate our own image:
subscription-service/build.gradle.kts
./gradlew jib -Djib.to.image=organization/repository:2.3.0-memory -p subscription-service
Thanks for the suggestion, it will avoid the need of timescale upgrade...
Thanks for the suggestion, it will avoid the need of timescale upgrade...
by the way, I'm having a problem upgrading Timescale there is an error :
2023-10-17 14:25:24.288 UTC [250] ERROR: could not access file "$libdir/timescaledb-2.9.1": No such file or directory
2023-10-17 14:25:24.288 UTC [250] CONTEXT: SQL statement "update cron.job_run_details set status = 'failed', return_message = 'server restarted' where status in ('starting','running')"
2023-10-17 14:25:24.291 UTC [236] LOG: background worker "pg_cron launcher" (PID 250) exited with exit code 1
As I wanted too keep the database, there is a "cron.job_run_details" job which wants to use the old timescaledb-2.9.1
lib, so I think a special issue is needed for this !.
Thanks for the suggestion, it will avoid the need of timescale upgrade...
Let me know if it fixes the memory problem.
Thanks for the suggestion, it will avoid the need of timescale upgrade...
by the way, I'm having a problem upgrading Timescale there is an error :
2023-10-17 14:25:24.288 UTC [250] ERROR: could not access file "$libdir/timescaledb-2.9.1": No such file or directory 2023-10-17 14:25:24.288 UTC [250] CONTEXT: SQL statement "update cron.job_run_details set status = 'failed', return_message = 'server restarted' where status in ('starting','running')" 2023-10-17 14:25:24.291 UTC [236] LOG: background worker "pg_cron launcher" (PID 250) exited with exit code 1
As I wanted too keep the database, there is a "cron.job_run_details" job which wants to use the old
timescaledb-2.9.1
lib, so I think a special issue is needed for this !.
Did you follow the instructions from https://stellio.readthedocs.io/en/latest/admin/upgrading_to_2.7.0.html? (and yes please open a new issue if you have a problem upgrading)
Hello, Yes but afterwards..., I did it manually in psql, from the pod, and the 2 databases went OK ! I wanted to go forward by doing it through kubectl exec, but I faced another issue when I started the search-service I make another issue to speak about those problems of migration :
The fact that the subscription-service was making out-of memory errors, was because of updating several hundreds of entities each minute after reading in a database, generating a huge quantity of notifications to post, then also the size of the database grew up at incredible speed...
We have added a cache, since, to avoid unuseful update on entities that have not changed, now it's far much better....
with the cache, the subscription-service, works fine in 2.3.0, with the initial memory configuration
the issue is not solved at all, since there's an issue with stellio 2.7.0, which reject '@' characters in @type
and @value
, we cannot use it, for the moment.
then we are still using 2.3.0,
We reproduced the issue this morning, with a lot of queued notification requests in kafka, when you start the subscription service, the pods fails with heap size memory error... When we add too many pods with the 2.3.0, we get database connection issues...
Then we need first to change the memory configuration of the subscription service, by building a new image for the 2.3 !
and may be we'll have to add more connections in the pool ???, but may be the issues are due to deadlocks generated by crash of subscription service...
I'll keep you informed...
OK, I managed to build a new image in 2.3.0, on my personal dockerhub repo
archietuttle/subscription-service:2.3.0-memory
memory configuration is now:
jib.container.jvmFlags = listOf("-Xms512m", "-Xmx1536m")
we'll retry tomorrow to create the same conditions, we need too fill kafka queue while subscription-service is stopped, and re-starting it when many notifications are to be consumed... to check if the out of memory error doesn't occurs...
the issue is not solved at all, since there's an issue with stellio 2.7.0, which reject '@' characters in
@type
and@value
, we cannot use it, for the moment. then we are still using 2.3.0
We added some checks (from the specification) related to supported entity content and name. It may have introduced some unwanted side effect. Do you have a sample payload that is rejected?
The fact that the subscription-service was making out-of memory errors, was because of updating several hundreds of entities each minute after reading in a database, generating a huge quantity of notifications to post, then also the size of the database grew up at incredible speed...
We have added a cache, since, to avoid unuseful update on entities that have not changed, now it's far much better....
Don't really get the thing here. You mean you were sending updates of entities but updated entities had no difference with existing entities?
Hello, even me I didn't understood, but there was a nifi flow file (a kind of death pipeline), which was reading every second, entities, for nothing..., now it's OK, the traffic is reasonable.
the issue is not solved at all, since there's an issue with stellio 2.7.0, which reject '@' characters in
@type
and@value
, we cannot use it, for the moment. then we are still using 2.3.0We added some checks (from the specification) related to supported entity content and name. It may have introduced some unwanted side effect. Do you have a sample payload that is rejected?
my colleague @JTrullier shall open an issue with this problem we discovered
As we have not a working solution, without building a specific image, I propose to keep this issue open !
May it's an issue with jib version or configuration ?
Yes, I'll search for a solution that allow dynamic configuration of the memory.
We are already using the last version of Jib. So probably something wrong or missing in the configuration.
Was doing some other tests.
Launched one service with this additional environment variable:
- JDK_JAVA_OPTIONS="-Xmx2048m"
If I connect to the container and check the JVM settings:
root@5d64df458ef1:/# java -XX:+PrintFlagsFinal -version | grep HeapSize
NOTE: Picked up JDK_JAVA_OPTIONS: "-Xmx2048m"
size_t ErgoHeapSizeLimit = 0 {product} {default}
size_t HeapSizePerGCThread = 43620760 {product} {default}
size_t InitialHeapSize = 163577856 {product} {ergonomic}
size_t LargePageHeapSizeThreshold = 134217728 {product} {default}
size_t MaxHeapSize = 2147483648 {product} {command line}
size_t MinHeapSize = 8388608 {product} {ergonomic}
uintx NonNMethodCodeHeapSize = 5839564 {pd product} {ergonomic}
uintx NonProfiledCodeHeapSize = 122909338 {pd product} {ergonomic}
uintx ProfiledCodeHeapSize = 122909338 {pd product} {ergonomic}
size_t SoftMaxHeapSize = 2147483648 {manageable} {ergonomic}
openjdk version "17.0.8.1" 2023-08-24
OpenJDK Runtime Environment Temurin-17.0.8.1+1 (build 17.0.8.1+1)
OpenJDK 64-Bit Server VM Temurin-17.0.8.1+1 (build 17.0.8.1+1, mixed mode, sharing)
Also tried the other way (reducing max heap size to 512m) and seems it is also OK:
root@8b4990b79b72:/# java -XX:+PrintFlagsFinal -version | grep HeapSize
NOTE: Picked up JDK_JAVA_OPTIONS: -Xms128m -Xmx512m
size_t ErgoHeapSizeLimit = 0 {product} {default}
size_t HeapSizePerGCThread = 43620760 {product} {default}
size_t InitialHeapSize = 134217728 {product} {command line}
size_t LargePageHeapSizeThreshold = 134217728 {product} {default}
size_t MaxHeapSize = 536870912 {product} {command line}
size_t MinHeapSize = 134217728 {product} {command line}
uintx NonNMethodCodeHeapSize = 5839564 {pd product} {ergonomic}
uintx NonProfiledCodeHeapSize = 122909338 {pd product} {ergonomic}
uintx ProfiledCodeHeapSize = 122909338 {pd product} {ergonomic}
size_t SoftMaxHeapSize = 536870912 {manageable} {ergonomic}
openjdk version "17.0.8.1" 2023-08-24
OpenJDK Runtime Environment Temurin-17.0.8.1+1 (build 17.0.8.1+1)
OpenJDK 64-Bit Server VM Temurin-17.0.8.1+1 (build 17.0.8.1+1, mixed mode, sharing)
That being said, I'm thinking about totally removing the default memory config for heap size in the build files. Then containers will use default settings for the JVM (see https://learn.microsoft.com/en-us/azure/developer/java/containers/overview#understand-jvm-default-ergonomics for some numbers that are valid for Eclipse Temurin 17). And users will then override according to their use-cases. Seems sensible?
Finally about to publish Docker images without any JVM memory settings in #1086.
Did an experiment on a DB filled with entities having a lot of history and doing a retrieve temporal entity asking for the whole history (in parallel with a configurable number of users). Without any specific memory configuration, it crashed at some point with OutOfMemoryError
errors. Added JDK_JAVA_OPTIONS=-Xms4g -Xmx6g
and it passed without any errors. Seems it is well taken into account.
We face out of memory heapsize errors in subscription-service, when under load :
We have to start more pods to absorb the load, but manually, because lack of HPA...
So I'd like to give more memory to the jvm The only place I saw such configuration is in
build.gradle.kts
file for spring based services : jib.container.jvmFlags = listOf("-Xms256m", "-Xmx768m") (for susbcription service)I wonder if it's possible to set a variable, in deployment to change this jvm option, without rebuilding the image ?
I have tried to set DEFAULT_JVM_OPTS, JVM_OPTS, without success !
It seems that there is no way except changing
build.gradle.kts
orgradlew
, and rebuild...Any idea to workaround and be able to change jvm options before starting the container ?