RSS usage increase between Quarkus 3.11 and 3.13 in MP application in native mode

rsvoboda commented 3 months ago

Describe the bug

After recent update from Quarkus 3.11 to 3.13 in QE StartStopTS (https://github.com/quarkus-qe/quarkus-startstop) we noticed RSS usage increase / failures due to threshold hit in native mode for application using MicroProfile / SmallRye extensions.

I was running my experiments on local Apple MBP M3 Pro, limiting running applications to the minimum, using 21.0.2-graalce. JVM mode runs didn't reveal noticeable difference

I did checks on more version, but these are the key ones:

3.13.2
AVG RSS (kB) without min and max values: 65446

3.12.3
AVG RSS (kB) without min and max values: 61893

3.11.3
AVG RSS (kB) without min and max values: 60969

These numbers are relevant to my system, you may see slightly different values when running the reproducer steps. But the key thing is that there is trend in increase of RSS usage with the same app after first successful request. The issue manifests in native mode, but imho it's not related to native-image tooling. Just FYI, I notice that RSS was slightly smaller when using Mandrel (23.1.4.r21-mandrel) to build the binary.

There is a slight increase between 3.11.3 and 3.12.3, difference between 3.11.3 / 3.12.3 and 3.13.2 is more noticeable, ~ 5-7 %.

Some details from native-image command: 3.13.2

  29.65MB (46.02%) for code area:    51,790 compilation units
  33.88MB (52.58%) for image heap:  377,317 objects and 78 resources
 921.53kB ( 1.40%) for other data
  64.43MB in total
------------------------------------------------------------------------------------------------------------------------
Top 10 origins of code area:                                Top 10 object types in image heap:
  12.10MB java.base                                           10.03MB byte[] for code metadata
   1.86MB quarkus-runner.jar                                   5.29MB byte[] for java.lang.String
   1.81MB c.f.jackson.core.jackson-databind-2.17.2.jar         4.07MB java.lang.Class
   1.34MB svm.jar (Native Image)                               3.49MB java.lang.String
   1.26MB modified-io.vertx.vertx-core-4.5.9.jar               1.39MB com.oracle.svm.core.hub.DynamicHubCompanion
 637.69kB io.netty.netty-buffer-4.1.111.Final.jar           1005.64kB byte[] for general heap data
 589.17kB com.fasterxml.jackson.core.jackson-core-2.17.2.jar 908.41kB byte[] for reflection metadata
 509.34kB io.netty.netty-common-4.1.111.Final.jar            704.97kB java.lang.String[]
 440.35kB io.netty.netty-codec-http-4.1.111.Final.jar        597.46kB c.o.svm.core.hub.DynamicHub$ReflectionMetadata
 401.02kB io.netty.netty-transport-4.1.111.Final.jar         510.38kB java.util.HashMap$Node
   8.47MB for 148 more packages                                5.97MB for 4166 more object types

3.12.3

  29.33MB (45.84%) for code area:    51,075 compilation units
  33.77MB (52.77%) for image heap:  374,656 objects and 76 resources
 910.64kB ( 1.39%) for other data
  63.99MB in total
------------------------------------------------------------------------------------------------------------------------
Top 10 origins of code area:                                Top 10 object types in image heap:
  12.09MB java.base                                            9.93MB byte[] for code metadata
   1.81MB c.f.jackson.core.jackson-databind-2.17.2.jar         5.25MB byte[] for java.lang.String
   1.73MB quarkus-runner.jar                                   4.02MB java.lang.Class
   1.41MB svm.jar (Native Image)                               3.47MB java.lang.String
   1.25MB modified-io.vertx.vertx-core-4.5.7.jar               1.37MB com.oracle.svm.core.hub.DynamicHubCompanion
 589.17kB com.fasterxml.jackson.core.jackson-core-2.17.2.jar1002.65kB byte[] for general heap data
 588.80kB io.netty.netty-buffer-4.1.108.Final.jar            896.98kB byte[] for reflection metadata
 494.40kB io.netty.netty-common-4.1.108.Final.jar            699.94kB java.lang.String[]
 447.51kB io.netty.netty-codec-http-4.1.108.Final.jar        591.60kB c.o.svm.core.hub.DynamicHub$ReflectionMetadata
 399.07kB io.netty.netty-transport-4.1.108.Final.jar         510.56kB java.util.HashMap$Node
   8.28MB for 147 more packages                                6.12MB for 4139 more object types

3.11.3

  29.06MB (45.68%) for code area:    50,844 compilation units
  33.67MB (52.92%) for image heap:  373,003 objects and 76 resources
 912.34kB ( 1.40%) for other data
  63.62MB in total
------------------------------------------------------------------------------------------------------------------------
Top 10 origins of code area:                                Top 10 object types in image heap:
  12.09MB java.base                                            9.84MB byte[] for code metadata
   1.81MB c.f.jackson.core.jackson-databind-2.17.1.jar         5.21MB byte[] for java.lang.String
   1.61MB quarkus-runner.jar                                   4.01MB java.lang.Class
   1.40MB svm.jar (Native Image)                               3.45MB java.lang.String
   1.25MB modified-io.vertx.vertx-core-4.5.7.jar               1.37MB com.oracle.svm.core.hub.DynamicHubCompanion
 589.09kB com.fasterxml.jackson.core.jackson-core-2.17.1.jar1001.41kB byte[] for general heap data
 588.80kB io.netty.netty-buffer-4.1.108.Final.jar            888.38kB byte[] for reflection metadata
 494.98kB io.netty.netty-common-4.1.108.Final.jar            696.34kB java.lang.String[]
 447.51kB io.netty.netty-codec-http-4.1.108.Final.jar        589.49kB c.o.svm.core.hub.DynamicHub$ReflectionMetadata
 399.07kB io.netty.netty-transport-4.1.108.Final.jar         510.19kB java.util.HashMap$Node
   8.16MB for 145 more packages                                6.20MB for 4134 more object types

Expected behavior

RSS usage stays similar between Quarkus 3.11 and 3.13 in MP application

Actual behavior

RSS usage increase between Quarkus 3.11 and 3.13 in MP application

How to Reproduce?

git clone https://github.com/quarkus-qe/quarkus-startstop
cd quarkus-startstop

sdk use java 21.0.2-graalce

mvn clean verify -pl testsuite -Dtest=StartStopTest#fullMicroProfileNative -Dquarkus.version=3.13.2 -Dstart-stop.iterations=25 | grep "AVG RSS"
mvn clean verify -pl testsuite -Dtest=StartStopTest#fullMicroProfileNative -Dquarkus.version=3.12.3 -Dstart-stop.iterations=25 | grep "AVG RSS"
mvn clean verify -pl testsuite -Dtest=StartStopTest#fullMicroProfileNative -Dquarkus.version=3.11.3 -Dstart-stop.iterations=25 | grep "AVG RSS"

Output of `uname -a` or `ver`

macOS

Output of `java -version`

Java 21

Quarkus version or git rev

3.13.2

Build tool (ie. output of `mvnw --version` or `gradlew --version`)

No response

Additional information

No response

quarkus-bot[bot] commented 3 months ago

/cc @Karm (mandrel), @galderz (mandrel), @zakkak (mandrel,native-image)

rsvoboda commented 3 months ago

I did some investigation on top of Quarkus main, my initial tips (OpenTelementry bump, SmallRye Config bump) didn't prove to be the case, after that I tries several custom build but I wasn't able to pin point a concrete commit causing the issues. It's more like accumulation of multiple changes.

Here is the dump of my notes (the goal is to get from 65MB to 59MB), I didn't add it to the description intentionally, it's already quite heavy:

mvn clean verify -pl testsuite -Dtest=StartStopTest#fullMicroProfileNative -Dquarkus.version=999-SNAPSHOT -Dstart-stop.iterations=25 | grep "AVG rssKb"

Quarkus main
2024-08-12 17:57:25.841 INFO  [i.q.t.s.StartStopTest] (testRuntime) AVG rssKb without min and max values: 65532

Quarkus main without otel bump - https://github.com/quarkusio/quarkus/commit/b6f54993c43f955f2b20855b9c4d694af7c210f2
2024-08-12 18:02:20.956 INFO  [i.q.t.s.StartStopTest] (testRuntime) AVG rssKb without min and max values: 65246

before MP Config update (HEAD detached at d7b84f6d01c)                                                  ~15.7.
  https://github.com/quarkusio/quarkus/commit/78045ea67413518f7c77271d08a70d03d1fffbbd
2024-08-12 18:17:15.291 INFO  [i.q.t.s.StartStopTest] (testRuntime) AVG rssKb without min and max values: 64582

https://github.com/quarkusio/quarkus/commits/main/?after=00b31c814afe25e3d15e0a99d639f7e4383a0341+609   ~14.7.
git checkout a8dfd70cc8c916d0f1e7e5c0eea21832e90c8785
HEAD is now at a8dfd70cc8c resources, not resource for path
2024-08-12 19:49:55.790 INFO  [i.q.t.s.StartStopTest] (testRuntime) AVG rssKb without min and max values: 64613

https://github.com/quarkusio/quarkus/commits/main/?after=00b31c814afe25e3d15e0a99d639f7e4383a0341+609   ~10.7.
git checkout aaf2960b388115bfd3e6fd903c71df8f6c1a8ba1
HEAD is now at aaf2960b388 Merge pull request #41793 from turing85/feature/parmeters-in-compile-plugin
2024-08-12 19:53:24.538 INFO  [i.q.t.s.StartStopTest] (testRuntime) AVG rssKb without min and max values: 63120

https://github.com/quarkusio/quarkus/commits/main/?after=00b31c814afe25e3d15e0a99d639f7e4383a0341+679   ~9.7.
git checkout 37b1fb98322230a9165105b146c978e36afa93bd
HEAD is now at 37b1fb98322 WebSockets Next: Dev UI fixes
2024-08-12 19:59:05.640 INFO  [i.q.t.s.StartStopTest] (testRuntime) AVG rssKb without min and max values: 62891

https://github.com/quarkusio/quarkus/commits/main/?after=00b31c814afe25e3d15e0a99d639f7e4383a0341+714   ~8.7.
git checkout eecc1ae55a45042e1fcb1bca36d70f8ad845a085
Previous HEAD position was 37b1fb98322 WebSockets Next: Dev UI fixes
2024-08-12 20:03:20.918 INFO  [i.q.t.s.StartStopTest] (testRuntime) AVG rssKb without min and max values: 61257

https://github.com/quarkusio/quarkus/commits/main/?after=00b31c814afe25e3d15e0a99d639f7e4383a0341+709   ~6.7.
git checkout 72d8bf6840b5aaf30392a69c03e460c2670b9c96
    Update telemetry-micrometer.adoc
2024-08-12 18:25:57.942 INFO  [i.q.t.s.StartStopTest] (testRuntime) AVG rssKb without min and max values: 61243
2024-08-12 20:06:16.126 INFO  [i.q.t.s.StartStopTest] (testRuntime) AVG rssKb without min and max values: 61376

https://github.com/quarkusio/quarkus/commits/main/?after=00b31c814afe25e3d15e0a99d639f7e4383a0341+779   ~3.7.
git checkout 0ae63d59354cfca3a32b4006be3ca4dcabeb9870
HEAD is now at 0ae63d59354 Support multiple recipe artifacts in quarkus update
2024-08-12 20:11:35.490 INFO  [i.q.t.s.StartStopTest] (testRuntime) AVG rssKb without min and max values: 61022

https://github.com/quarkusio/quarkus/commits/main/?after=00b31c814afe25e3d15e0a99d639f7e4383a0341+884   ~30.6.
git checkout b310b2699f2113276cf2f6a0ded72f7587f13693
HEAD is now at b310b2699f2 Bump ubi image to 1.19 for jib
2024-08-12 20:18:43.963 INFO  [i.q.t.s.StartStopTest] (testRuntime) AVG rssKb without min and max values: 61079

https://github.com/quarkusio/quarkus/commits/main/?after=00b31c814afe25e3d15e0a99d639f7e4383a0341+954   ~26.6.
git checkout 872ffdcc60ac75a86c7e22258c9d3d1eb1ba2dba
HEAD is now at 872ffdcc60a Bump io.quarkus:quarkus-platform-bom-maven-plugin
2024-08-12 20:23:01.314 INFO  [i.q.t.s.StartStopTest] (testRuntime) AVG rssKb without min and max values: 60974

RSS is decreasing slowly, so this leads me to the conclusion that it's more like accumulation of multiple changes that lead into such increased RSS I see between 3.11 and 3.13.

quarkus-bot[bot] commented 3 months ago

/cc @EricWittmann (openapi), @Karm (mandrel), @Ladicek (fault-tolerance,smallrye), @MikeEdgar (openapi), @Sanne (core), @aloubyansky (core), @brunobat (tracing), @cescoffier (mutiny,reactive-messaging), @ebullient (metrics), @galderz (mandrel), @gsmet (core), @jmartisk (graphql,health,metrics,smallrye), @jponge (mutiny), @ozangunalp (reactive-messaging), @phillip-kruger (graphql,openapi,smallrye), @radcortez (core,smallrye,tracing), @sberyozkin (jwt), @stuartwdouglas (core), @xstefank (health), @zakkak (mandrel,native-image)

geoand commented 3 months ago

cc @franz1981

franz1981 commented 3 months ago

This same trend is confirmed on the CI with stable Linux machines as well? IIRC the start stop test should run wild, without any max heap settings: what happen if you cap it to some value (eg twice the max heap used while running loose)?

In case it is, and the native image cmd doesn't reveal anything relevant (@zakkak could help there I think), if the additional weight come from progressively increasing amount of features, it's ok. But if we believe we could make things better or it is unacceptable, we can:

have an heap dump with GC, without GC running of the 2 versions and compare
have some allocation flamegraphs (@galderz now native image should be able to collect alloc JFR events, no?) and check what we can do to improve there

rsvoboda commented 3 months ago

This same trend is confirmed on the CI with stable Linux machines as well?

That's perf lab only, we do not have bare metal machines in our lab.

What about JVM mode?

As mentioned in the description, JVM mode runs didn't reveal noticeable difference. It's native mode where the diff manifests. It can be the case that as JVM mode uses more memory the diff is not detected ...

franz1981 commented 3 months ago

@rsvoboda I mean; the description here made me think it was a macOS run, but the difference is still the same while moving to Linux, right?

rsvoboda commented 3 months ago

Yes, @mjurc tried the reproducer commands on his Fedora machine and he saw similar trend, the absolute numbers were of course a bit different. Michal confirmed that RSS was higher for 3.13 comparing to 3.11.

galderz commented 3 months ago

@franz1981 don't know if JFR alloc events are in place (@roberttoyonaga ?), I worked on the old object event but that's more for tracking leaks and this doesn't smell like a leak right now.

@rsvoboda @mjurc if you run the test manually, can you see if the rss increase is due to startup and/or first request? again if you run the test manually you should get heap dumps after startup and after first request with the different versions and see if we can spot some differences there

One more thing to maybe try here is @roberttoyonaga NMT work in case the increases are not spotted in heap contents.

galderz commented 3 months ago

@franz1981 Ah yes we now have ObjectAllocationInNewTLAB, definitely want to jfr that @rsvoboda @mjurc

Finally, @rsvoboda run through the steps in https://quarkus.io/blog/native-startup-rss-troubleshooting and see if you can get a better understanding of the cause. As we highlighted there, the way native handles memory can lead to bigger increases caused by small changes compared to jvm mode.

roberttoyonaga commented 3 months ago

Yes Native Image JFR has both the ObjectAllocationInNewTLAB event and the ObjectAllocationSample event. These events sample object allocations that result in requiring a new TLAB. They record stacktraces, so you can generate flamegraphs in JMC too. ObjectAllocationSample is basically the same as ObjectAllocationInNewTLAB but it's throttled to reduce overhead. It should safe to enable that event in production, but ObjectAllocationInNewTLAB has very high overhead.

yes, like Galder mentioned, you could also try "native memory tracking" to see where native allocation are happening off the "Java" heap. There's a blog post here: https://developers.redhat.com/articles/2024/05/21/native-memory-tracking-graalvm-native-image#

rsvoboda commented 3 months ago

Thanks for the pointers. I will try to look more into it, but my week is quite packed before I go on pto next week.

jerboaa commented 3 months ago

@roberttoyonaga @rsvoboda We should mention that not all upstream features in master or 24.1 aren't yet in a released version of mandrel. So feature availability may vary. AFAIK, main mandrel version is 23.1 (mandrel for JDK 21) in quarkus.

roberttoyonaga commented 3 months ago

Oh yes, that's true. ObjectAllocationSample and native memory tracking are in the GraalVM EA build. But heap dumping and the ObjectAllocationInNewTLAB JFR event will be available in mandrel for JDK 21.

jerboaa commented 3 months ago

Oh yes, that's true. ObjectAllocationSample and native memory tracking are in the GraalVM EA build.

FWIW, CE (or graalvm community builds) early access builds are here which we should be testing: https://github.com/graalvm/graalvm-ce-dev-builds/releases. They're based on OpenJDK and open source code.

zakkak commented 3 months ago

I have tried reproducing this on my Fedora setup with both GraalVM CE 21.0.2 (from SDK man) and the default builder image (Mandrel 23.1.4.0) and the results are:

3.11.3.graal.txt:2024-08-20 16:26:28.538 INFO  [i.q.t.s.StartStopTest] (testRuntime) AVG RSS (kB) without min and max values: 72581

3.11.3.txt:2024-08-20 16:13:38.929 INFO  [i.q.t.s.StartStopTest] (testRuntime) AVG RSS (kB) without min and max values: 72692

3.13.2.graal.txt:2024-08-20 16:28:15.357 INFO  [i.q.t.s.StartStopTest] (testRuntime) AVG RSS (kB) without min and max values: 73427

3.13.2.txt:2024-08-20 16:11:12.395 INFO  [i.q.t.s.StartStopTest] (testRuntime) AVG RSS (kB) without min and max values: 73282

Given that the image size increased by ~1MB I don't find the above increase in RSS usage surprising, but the reported (by @rsvoboda) RSS increase is quite larger than that :/

@mjurc when reproducing this did you notice similar increase to what @rsvoboda reports or a smaller one like the one in my measurements?

@rsvoboda what are the thresholds set to in the CI? Can you also share the actual numbers from these runs? The default in the repository seems to be quite high at 90000 kB.

franz1981 commented 3 months ago

Thanks @zakkak for checking; iirc that benchmark is NOT having any limit on memory or xms/xmx nor of number of cores; both things which could affect RSS in the way the allocations happen vs GC cycles (vs malloc arenas could be used) . Maybe would be beneficial to have both information while trying to reproduce it

radcortez commented 3 months ago

@franz1981, how is the config-quickstart RSS?

franz1981 commented 3 months ago

@radcortez still on holiday and with no VPN access, @johnaohara can answer you...but last time I checked there was a super small increase

zakkak commented 3 months ago

Thanks @zakkak for checking; iirc that benchmark is NOT having any limit on memory or xms/xmx nor of number of cores; both things which could affect RSS in the way the allocations happen vs GC cycles (vs malloc arenas could be used) . Maybe would be beneficial to have both information while trying to reproduce it

@franz1981 If I get it right it sets xmx to 96m, see https://github.com/quarkus-qe/quarkus-startstop/blob/643dadc30f810333f8a0c8ef846e7aaac53f9a7e/testsuite/src/it/java/io/quarkus/ts/startstop/utils/MvnCmds.java#L24, regarding the rest you are right it doesn't control them.

franz1981 commented 3 months ago

Uh, I have missed it (my memory is failing me, sorry :"/ I should have checked) - thanks to have verified it

rsvoboda commented 3 months ago

I did some experiments with jfr and nmt recording with Quarkus MP app. As https://developers.redhat.com/articles/2024/05/21/native-memory-tracking-graalvm-native-image suggests, I used EA build of GraalVM.

Here is recording-nmt-3.11-3.13.zip with recording-nmt-3.11.3.jfr and recording-nmt-3.13.2.jfr. GitHub allows only "known file extensions" and thus I had to zip the files.

Native binaries were produced from https://github.com/quarkus-qe/quarkus-startstop repo using this procedure:

export PATH=/Users/rsvoboda/Downloads/graalvm-jdk-24+10.1/Contents/Home/bin:$PATH

mvn clean package -f app-full-microprofile -Dnative -Dquarkus.native.additional-build-args='--enable-monitoring=heapdump\,jfr\,nmt' -Dquarkus.version=3.13.2
mv app-full-microprofile/target/quarkus-runner ~/Downloads/app-full-microprofile-nmt-3.13.2 

mvn clean package -f app-full-microprofile -Dnative -Dquarkus.native.additional-build-args='--enable-monitoring=heapdump\,jfr\,nmt' -Dquarkus.version=3.11.3
mv app-full-microprofile/target/quarkus-runner ~/Downloads/app-full-microprofile-nmt-3.11.3

JFR was produced using:

get_jfr () {
  $1 -XX:+FlightRecorder -XX:StartFlightRecording=filename=$2 &
  APPLICATION_PID=$!

  counter=0
  until curl -s -f -o /dev/null "http://127.0.0.1:8080/q/health/ready"
  do
    sleep 0.001
    counter=$((counter+1))
  done
  echo "Cycles: $counter"
  echo "RSS: `ps -p $APPLICATION_PID -o rss=`"
  sleep 0.5
  echo "RSS after 0.5s: `ps -p $APPLICATION_PID -o rss=`"

  kill -SIGTERM "$APPLICATION_PID"
  wait "$APPLICATION_PID"
}

get_jfr ${HOME}/Downloads/app-full-microprofile-nmt-3.13.2 ${HOME}/Downloads/recording-nmt-3.13.2.jfr
get_jfr ${HOME}/Downloads/app-full-microprofile-nmt-3.11.3 ${HOME}/Downloads/recording-nmt-3.11.3.jfr

I noticed in that the flame graph under Event Browser - Java Application has some additional bars for 3.13 comparing to 3.11

At this stage I think I'm at my knowledge and investigation limits, any insight would be welcome. I shared my reproducer steps so anybody can play with this.

roberttoyonaga commented 3 months ago

Hi @rsvoboda ! I was able to reproduce similar JFR files to you and saw a similar RSS increase ~4MB.

I changed your script slightly by adding -XX:+PrintNMTStatistics to dump a native memory usage report. And I also supplied a custom JFR configuration (.jfc file) where I bumped up the ObjectAllocationSample event throttling rate to 100/s and turned on the ObjectAllocationInNewTLAB event.

So the command now became: $1 -XX:+FlightRecorder -XX:StartFlightRecording=filename=$2,settings=quarkus-demo.jfc -XX:+PrintNMTStatistics

NMT results: 3.13, 3.11 Only 64KB more memory used in 3.13. Mainly due to JFR. So native memory usage is probably not the culprit.

Next, I looked at the ThreadAllocationStatistics JFR event. I notice that the "main" thread is the biggest allocator by several orders of magnitude. It also looks like the "main" thread allocates 3-4 MB more in 3.13 than in 3.11. This approximately matches the overall RSS increase that we're seeing.

Next, I looked at the ObjectAllocationSample JFR event. If I limit the flamegraphs to only show ObjectAllocationSample events emitted from the "main" thread (to reduce noise a bit), I notice that SmallRyeConfigBuilder.build() has significantly more samples in 3.13 than in 3.11. Each of those samples corresponds to a new TLAB of size 512KB (this is due to where the allocation sampling instrumentation is placed in SubstrateVM).

3.13 Screenshot from 2024-08-22 17-17-04 3.11

When I turn on the ObjectAllocationInNewTLAB event I see that ~7 more TLABs are allocated in 3.13, than in 3.11.

Additionally, in the flamegraphs you show in your comment above, I think you are displaying all events, not just allocations. This means all events with stacktraces are represented in those graphs (ThreadStart, MonitorEnter, etc), so doesn't give a good picture of what's happening with allocations specifically.

So maybe the RSS increase has something to do with SmallRyeConfigBuilder?

franz1981 commented 3 months ago

Thanks @roberttoyonaga for the analysis

So, few action points:

@rsvoboda it would be great to have a config option to enable JFR allocation events to be collected in the native startstop tests, but with few or no throttling (makes sense @roberttoyonaga ?) or just alloc in new tlab ones (which shouldn't have any throttling and usually are responsible od RSS increase); this is similar to what we do in JVM mode, but we disable tlab too, in order to capture all allocation events. When a startup time is less than a second there is very few to sample...
@radcortez if you can get the JFR of the slow tlab events it is likely what you need; I suspect that our typical way to do it in JVM mode (outlined in the previous point) will show it the same, but probably native image is more sensible and less noisy due to the much lower base footprint

I will come back from vacation next week, but thanks for proceeding in this ❤️

gsmet commented 3 months ago

I started having a quick look at that and I can add some more data.

From a CPU standpoint, the main contributors are:

the Micrometer registry initialization
the Vert.x initialization but a significant part of it is the metrics initialization for Vert.x
the OpenTelemetry initialization - the Autoconfigure thingy seems to be quite heavy
the OpenAPI initialization
the initialization of the ObjectMapper
the config reading with two main things I could see:
- the UUID generation in RuntimeConfigBuilder is quite slow as we have to initialize the secure random, I'm still wondering if we should find a way to generate it on demand)
- SmallRyeConfig.getPropertyNames() is very costly on the first call - maybe it's a stupid idea but I wonder if we could generate some metadata at build time for our extensions that could be used instead of generating all the property names each time. I'm pretty sure that there are a lot of subtleties but I wonder if we could at least have some of the state generated once and for all for the config of our extensions and have a mechanism in SmallRyeConfig that would load this data if available. For extensions, we know that the SmallRye Config version will be the same at build and runtime (or people are looking for issues and it's definitely not supported). I know it's a crazy idea but maybe it will resonate in Roberto's mind :) /cc @radcortez
- dunno if it's of interest but we have some SmallRyeConfig.getPropertyNames() work twice: once in RuntimeConfigSetup and the other time when Config is class initialized.

From an allocation perspective, the story is relatively similar (except for the UUID part).

gsmet commented 3 months ago

So these are the two flamegraphs of where SmallRyeConfig.getPropertyNames() is involved.

Here you have it very visible:

Screenshot from 2024-08-24 17-18-00

Here it's at the very left:

Screenshot from 2024-08-24 17-21-57

gsmet commented 3 months ago

Then I had also a few other comments about SmallRye Config init:

We are still reading some annotations - isn't it something we should work on?

Screenshot from 2024-08-24 17-25-19

And it resonates a bit with the calls to the service loader going through the jars to collect the config sources:

Screenshot from 2024-08-24 17-29-27

I wonder if we should have a build time phase for SmallRye Config that would initialize information about:

the property names around
the converters around - with their priorities
the config sources

Basically everything that we could get from the class path scanning given we are in a closed world. And the SmallRye Config bootstrap would take it into account and avoid some work when in the context of Quarkus.

@radcortez you would have to confirm if it makes some sense or if it's totally crazy.

From my perspective, the config is a critical part of infra and we have seen it being an important contributor to the bootstrap so it would be worth investing in optimizing this if we think it makes sense.

I would be willing to help.

geoand commented 3 months ago

This is very useful information, thanks!

I've long thought that we need to put some effort into optimizing the Observability stack. A low footprint framework like Framework should try hard to have production ready features like observability be as low footprint as possible. @brunobat do you think that we are in a good place in terms of features whereas it makes sense to start looking into the optimizations direction?

franz1981 commented 3 months ago

Nice analysis @gsmet did you used the "new" async profiler support I have added to the start stop test? Or it is captured manually?

They look as JVM mode ones

franz1981 commented 3 months ago

@radcortez @gsmet I'm sending a small patch (I got some cycles this weekend to have fun)

The simpler thing I see that is not right there, is a huge amount of HyphenateEnumConverter converters allocated on the same enum class i.e. io.smallrye.openapi.api.OpenApiConfig$OperationIdStrategy

Just dropping them will reduce significatly the allocation pressure, see https://github.com/quarkusio/quarkus/pull/42745

franz1981 commented 3 months ago

@rsvoboda https://github.com/quarkusio/quarkus/pull/42745 should have improved enough - let me know if you could check if it's enough to close this. In my tests that expensive converter was allocated more than 200 times - using some wrongly sized HashMap and StringBuilder, which was making things even worse

rsvoboda commented 2 months ago

@franz1981 I will check Quarkus main after (or during ;) ) the calls I have today.

franz1981 commented 2 months ago

@rsvoboda @gsmet now things seems a bit better but we still have few converters allocated vs the same class; i hope today to send another one-liner PR to fix such, but is just a minor

brunobat commented 2 months ago

Georgios Andrianakis @.***> escreveu (sábado, 24/08/2024 à(s) 16:36):

This is very useful information, thanks!

I've long thought that we need to put some effort into optimizing the Observability stack. A low footprint framework like Framework should try hard to have production ready features like observability be as low footprint as possible. @brunobat https://github.com/brunobat do you think that we are in a good place in terms of features whereas it makes sense to start looking into the optimizations direction?

— Reply to this email directly, view it on GitHub https://github.com/quarkusio/quarkus/issues/42506#issuecomment-2308433078, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARV5TRNGFEP767FA4COTO3ZTCR7TAVCNFSM6AAAAABMOBP2DKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBYGQZTGMBXHA . You are receiving this because you were mentioned.Message ID: @.***>

I think we should burn some cycles doing performance work even if we are not feature complete. Just for the sake of measuring the impact of upgrades and new features...

We already have performance work done on the OTel tracing Stack, however, the results were not conclusive and no performance red flags were found in OTel... But Luis Barreiro did fix a bug on an ActiveMQ client because of it. Saying this, I've started a new repository to perform more targeted tests. See: https://github.com/brunobat/quarkus-observability-perf

I already have a meeting scheduled with @franz1981 https://github.com/franz1981 to go more in-depth into it.

Cheers! -- Bruno Baptista http://twitter.com/brunobat_

geoand commented 2 months ago

Cool, feel free to include me as well as I am interested very much in this

franz1981 commented 2 months ago

FYI @gsmet @brunobat https://github.com/quarkusio/quarkus/pull/42758

Not sure it's "right" - but looking at the impl it seems correct

rsvoboda commented 2 months ago

I run my local experiments with Quarkus main, Quarkus 3.13 and Quarkus 3.13 + https://github.com/quarkusio/quarkus/pull/42745 PR

tl;dr: we got back to good (3.11 like) numbers on my machine

Quarkus 3.11.3: 61122 kB Quarkus main: 61623 kB Quarkus 3.13: 65646 kB Quarkus 3.13 + https://github.com/quarkusio/quarkus/pull/42745 PR: 61465 kB

https://github.com/quarkusio/quarkus/pull/42745 removes line removes unused converter allocation. This line was introduced in 3.9 already with https://github.com/quarkusio/quarkus/pull/39196.

So we have that line in 3.11 (https://github.com/quarkusio/quarkus/blame/3.11/extensions/smallrye-openapi/runtime/src/main/java/io/quarkus/smallrye/openapi/runtime/OpenApiConfigMapping.java#L31) which is the base version used to identify the RSS difference reported in this issue.

So I went ahead and tried 3.11 branch without that one line in OpenApiConfigMapping and I got better results

Quarkus 3.13 + https://github.com/quarkusio/quarkus/pull/42745 PR: 61465 kB Quarkus 3.11 + https://github.com/quarkusio/quarkus/pull/42745 PR: 58714 kB

It's like 4-5% better than the original 3.11.3 baseline. So there is definitely something else to look into ...

franz1981 commented 2 months ago

So I went ahead and tried 3.11 branch without that one line in OpenApiConfigMapping

Nice and well done @rsvoboda !: so it seems I've improved things enough to be on par, although not really fixing the actual cause (between 3.11 and 3.13) :+1:

radcortez commented 2 months ago

So these are the two flamegraphs of where SmallRyeConfig.getPropertyNames() is involved.

Yes, this is expected. One is for the static init config, and the other is for the runtime config.

Unfortunately, I was already aware that it was an expensive operation. It was worse in the past; the names are currently being cached. The issue is with https://github.com/smallrye/smallrye-config/blob/df6ac6251e757f90eaf814c9c4f0044c9435849b/implementation/src/main/java/io/smallrye/config/ConfigSourceInterceptor.java#L54.

This allows each interceptor to rewrite the list of names, required for relocations, fallbacks, profiles, etc. Not only there a lot of String operations happening here, but also the Lists and Iterators allocation.

I have some ideas on how to improve this, but I have been avoiding this until it becomes a real problem because it will require breaking the API :)

radcortez commented 2 months ago

I wonder if we should have a build time phase for SmallRye Config that would initialize information about:

We have been slowly doing that work since version 3.3 when we did some of the significant Config improvement changes:

And many other minor changes. The strategy has been mostly going for the low-hanging fruits / more considerable bottlenecks

the property names around

Not sure if we can do something here. The list of property names are always going to be dependent of the sources available at each phase. The real bottleneck is not reading the property (it may be, if we are reading properties from a slow source, database for instance), but the name transformation and the collection allocation in the interceptors.

the converters around - with their priorities

the config sources

This mostly done already: https://github.com/quarkusio/quarkus/blob/3877f161524d23217a328cd5f2cc777daefa275e/core/deployment/src/main/java/io/quarkus/deployment/steps/ConfigGenerationBuildStep.java#L200-L285

There are still a few leftovers, like the Converter @Priority annotation. Also, there are cases where the user / library creates their own SmallRyeConfig instance (for numerous reasons). I'm not sure if this is the case, but when that happens there is not much we can do about it.

@radcortez you would have to confirm if it makes some sense or if it's totally crazy.

Yes, it makes sense. We have been doing that work. Let me run some flamegraphs and I'll try to come up with a list of things to improve by priority / effort.

radcortez commented 2 months ago

@rsvoboda #42745 should have improved enough - let me know if you could check if it's enough to close this. In my tests that expensive converter was allocated more than 200 times - using some wrongly sized HashMap and StringBuilder, which was making things even worse

Can you elaborate on the HashMap and StringBuilder problem?

gsmet commented 2 months ago

The list of property names are always going to be dependent of the sources available at each phase. The real bottleneck is not reading the property (it may be, if we are reading properties from a slow source, database for instance), but the name transformation and the collection allocation in the interceptors.

So I agree with all this but my question is: don't you know which sources are available for each phase at build time?

Note that I know nothing about SmallRye Config so I'm asking dumb questions :).

There are still a few leftovers, like the Converter @Priority annotation.

That's cool we made progress on this and I know you did a lot of work on this already. Would still be cool to take a step back and see if we could tackle this once and for all. The best way to optimize code is to not execute it :).

Can you elaborate on the HashMap and StringBuilder problem?

Dunno about the HashMap but I saw a wrongly size StringBuilder in StringUtil#skewer(). I actually tried to improve it and did some JMH benchmarks and it was slower with the right size... And consistently slower. I wanted to ping @franz1981 about this as it went against everything I would have expected.

We cannot always guess the size of it but we could at least have a better approximation than the default 16.

Yes, it makes sense. We have been doing that work. Let me run some flamegraphs and I'll try to come up with a list of things to improve by priority / effort.

The sense of my message was that maybe we shouldn't spend too much time doing micro-optimizations but take a step back if we could completely skip some work. Note that I have no idea if it's actually doable, just throwing crazy ideas at you. I'm very willing to discuss it with you if you come up with some ideas.

gsmet commented 2 months ago

@rsvoboda can you easily drop some of the extensions in your app? Maybe that would be a good method to try to pinpoint what's going on. Because the OpenTelemetry stuff for instance has changed a lot and could also be the culprit.

If it's easy, I would try to drop some extension, see how it goes, then reintroduce it and drop the next.

geoand commented 2 months ago

I'm all for taking a step back and looking at more grand solutions, let's just keep in mind however that SR Config is super sensitive to regressions. Changing something is one area often leads to issues no one had foreseen, simply because SR Config is so central to both build time and runtime phases of Quarkus

rsvoboda commented 2 months ago

The app is here: https://github.com/quarkus-qe/quarkus-startstop/tree/main/app-full-microprofile

I can try to trim the extensions to see if things can be more visible.

Btw, I already looked at otel bum, see https://github.com/quarkusio/quarkus/issues/42506#issuecomment-2286015021. There is Quarkus main without otel bump - https://github.com/quarkusio/quarkus/commit/b6f54993c43f955f2b20855b9c4d694af7c210f2 line, it didn't change RSS much.

franz1981 commented 2 months ago

@radcortez

Can you elaborate on the HashMap and StringBuilder problem?

Yep, the enum converter often requires just a single element map, which could use Map::of avoiding the default entry size of HashMaps. But https://github.com/quarkusio/quarkus/pull/42745 already improved things enough that not that much enum converters are created (just 2 likely or something similar)...

rsvoboda commented 2 months ago

I did some experiments and the biggest diff I noticed is with resteasy-reactive-jackson

I have prepared branch for your https://github.com/rsvoboda/quarkus-startstop/tree/reactive-jackson-and-health Quarkus main I use is based on 60994bfa8d3

Command:

mvn clean verify -pl testsuite -Dtest=StartStopTest#fullMicroProfileNative -Dquarkus.version=999-SNAPSHOT -Dstart-stop.iterations=25 | grep "AVG RSS"
mvn clean verify -pl testsuite -Dtest=StartStopTest#fullMicroProfileNative -Dquarkus.version=3.11.3 -Dstart-stop.iterations=25 | grep "AVG RSS"

Results (first row is main, second row is 3.11.3)

quarkus-resteasy-reactive-jackson + quarkus-smallrye-health
INFO  [i.q.t.s.StartStopTest] (testRuntime) AVG RSS (kB) without min and max values: 39653
INFO  [i.q.t.s.StartStopTest] (testRuntime) AVG RSS (kB) without min and max values: 37689

quarkus-resteasy-reactive-jackson
INFO  [i.q.t.s.StartStopTest] (testRuntime) AVG RSS (kB) without min and max values: 38055
INFO  [i.q.t.s.StartStopTest] (testRuntime) AVG RSS (kB) without min and max values: 36221

We have simple REST app without Jackson dependency - https://github.com/rsvoboda/quarkus-startstop/blob/reactive-jackson-and-health/app-jakarta-rest-minimal/src/main/java/com/example/quarkus/HelloController.java. There is also noticeable diff:

mvn clean verify -pl testsuite -Dtest=StartStopTest#jakartaRESTMinimalNative -Dquarkus.version=999-SNAPSHOT -Dstart-stop.iterations=25 | grep "AVG RSS"
mvn clean verify -pl testsuite -Dtest=StartStopTest#jakartaRESTMinimalNative -Dquarkus.version=3.11.3 -Dstart-stop.iterations=25 | grep "AVG RSS"

INFO  [i.q.t.s.StartStopTest] (testRuntime) AVG RSS (kB) without min and max values: 37493
INFO  [i.q.t.s.StartStopTest] (testRuntime) AVG RSS (kB) without min and max values: 36109

I did more experiments with MP application and various extensions combinations but I unfortunately didn't eliminate quarkus-smallrye-openapi from the mix as Quarkus main contains https://github.com/quarkusio/quarkus/pull/42745 but 3.11 does those unnecessary initializations. I will do new checks without openapi dependency and post them tomorrow.

rsvoboda commented 2 months ago

RSS is measured after invocation of the first array entry from https://github.com/rsvoboda/quarkus-startstop/blob/reactive-jackson-and-health/testsuite/src/it/java/io/quarkus/ts/startstop/utils/URLContent.java#L14

Adding new String[]{"http://localhost:8080/data/serialization/json/complex-dto", "Vive la résistance!"}, as the first line for FULL_MICROPROFILE enum triggers measurement with Jackson, the diff is again ~2MB

quarkus-resteasy-reactive-jackson + quarkus-smallrye-health + /data/serialization/json/complex-dto invocation
INFO  [i.q.t.s.StartStopTest] (testRuntime) AVG RSS (kB) without min and max values: 41492
INFO  [i.q.t.s.StartStopTest] (testRuntime) AVG RSS (kB) without min and max values: 39454

franz1981 commented 2 months ago

OT: now in main we could use quarkus.rest.jackson.optimization.enable-reflection-free-serializers=true

And see what happen (in theory should help)

@rsvoboda i see on the pom that the dep is named https://github.com/rsvoboda/quarkus-startstop/blob/reactive-jackson-and-health/app-full-microprofile%2Fpom.xml

Why not (as per https://quarkus.io/guides/rest-migration) name it (as with any other dep which need it) into quarkus-rest-jackson?

I remember that there was something in the quarkus startup which takes care of correctly detect the previous naming convention (see https://github.com/quarkusio/quarkus/pull/39444 for an improvement i have sent in that part)

rsvoboda commented 2 months ago

I did that "switch back" intentionally few days ago - https://github.com/rsvoboda/quarkus-startstop/commit/414a99e454fa8248ac824c041c489552d9084780

quarkusio / quarkus