3.2.4.Final is slower compared to 2.15.1.Final

Karthikrao121 commented 1 year ago

Description

We recently migrated to the latest version of quarkus. After the upgrade to the latest version we are seeing slower startup times compared to the 2.15.1.Final version.

native (powered by Quarkus 2.15.1.Final) started in 0.308s. native (powered by Quarkus 3.2.4.Final) started in 0.423s.

Need migration guide for performance enhancements on 3.2.4.Final.

Implementation ideas

No response

gsmet commented 1 year ago

Need migration guide for performance enhancements on 3.2.4.Final.

I think we would first need a lot more data :).

The list of extensions would be a good start. For instance, we know that Hibernate ORM 6 comes with more classes that need to be initialized and causes a slow down.

Then I would continue by checking what's happening in JVM mode and if it is significantly slower as well, you could have a look at https://github.com/quarkusio/quarkus/blob/main/TROUBLESHOOTING.md to profile it.

Then we might want to actually profile the native executable but I'm not sure how to do that. @zakkak would know if it's possible.

zakkak commented 1 year ago

Then we might want to actually profile the native executable but I'm not sure how to do that. @zakkak would know if it's possible.

That's definitely possible, but as you noted it should be the last step :)

You can see some examples in https://quarkus.io/guides/native-reference#inspecting-and-debugging

Furthermore, @galderz and @roberttoyonaga recently wrote a couple of great guides related to profiling native executables:

Karthikrao121 commented 1 year ago

These are from 2.15.1.Final Installed features: [amazon-dynamodb, amazon-lambda, amazon-s3, amazon-secretsmanager, amazon-ssm, cdi, logging-json, rest-client-reactive, rest-client-reactive-jackson, resteasy-reactive, resteasy-reactive-jackson, security, smallrye-context-propagation, smallrye-openapi, vertx]" These are from 3.2.4.Final "Installed features: [amazon-dynamodb, amazon-lambda, amazon-s3, amazon-secretsmanager, amazon-ssm, cdi, logging-json, rest-client-reactive, rest-client-reactive-jackson, resteasy-reactive, resteasy-reactive-jackson, security, smallrye-context-propagation, vertx]"

maxandersen commented 1 year ago

@Karthikrao121 have you tried some of suggestions above to trouble shoot where changes in performance occur? are you only seeing startup changes or also changes in memory and throughput runtime performance?

Karthikrao121 commented 1 year ago

@maxandersen memory consumption also increased on 3.2.4.Final. on 2.15.Final Memory Size: 128 MB Max Memory Used: 116 MB startup times are around 400 ms to 500 ms on 3.2.4.Final Memory Size: 256 MB Max Memory Used: 154 MB even with memory increase startup times are consistently over 600ms

gsmet commented 1 year ago

@Karthikrao121

For the startup time, have a look at https://github.com/quarkusio/quarkus/blob/main/TROUBLESHOOTING.md , you should be able to get a CPU profile that would be a good start. Get one for both versions so that we can compare.

For the memory, it would be helpful if you could get a heap dump.

There's a good chance you won't want to make these public but if you're comfortable sharing them with me (and they don't contain sensitive information), you can send me an email with a download link at gsmet at redhat dot com.

The other option is that you have a look yourself and share your insights.

Thanks.

Karthikrao121 commented 1 year ago

@gsmet please provide your insights on the shared CPU profile details.

geoand commented 1 year ago

@franz1981 do we have any resources to help @Karthikrao121 answer the question above? I can obviously point to the AsyncProfiler repo, but I am wondering if we have any of our own resources about using it :)

franz1981 commented 1 year ago

Sadly we don't have any resources re quarkus for async profiler, because it's a kind of "moving target".

Before using async profiler we need to understand first if they can observe the same behaviour in JVM mode. Regardless, If they are running in a container or not, and how many resources they dedicate to it. The latter part is very important, because an increased allocation pressure can pass nearly unnoticed with enough available memory (GCs works like that); in short: if they have lessen constrains (cpu/mem) things get better? If yes, which resource impact the most? That would narrow If is just a memory related problem (as it seems).

As suggest by @gsmet If they could collect profiling data and/or use the method from @galderz to check if more (native) allocations are happening and where, on native mode, it would be great/perfect. If they can provide collapsed/folded profiling data of the 2 versions, flamegraphs allow to easily create diff flamegraphs for comparison, which are pretty handy (see https://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html).

If they cannot troubleshoot in native we have 2 other ways:

the same behaviour is reproducible with JVM mode (in an unconstrained env, CPU and memory-wise)? If yes, and it's related allocation, they can use https://github.com/quarkusio/quarkus/issues/35406#issuecomment-1692824446 to provide data about allocations happening at startup in the 2 cases. It would help to trivially find and fix this, given that the difference in memory used was noticiable
Occam rule: exclude the easier things to check between the versions which can impact startup time. Context propagation should have changed a lot; they really need it? What happen if they tune it to reduce its overhead? @mkouba told me about a property to do it (maybe @geoand remember as well). It's just an example, I don't know if ctx prop can impact startup time, but I hope it's clear what I meant

Karthikrao121 commented 1 year ago

I did provide the async cpu profiles from both versions to @gsmet. what we observed is if we increase the memory and cpu the start up times for 3.2.4.Final are as same as 2.15.1.Final. But 2.15.1.Final is doing faster start up times at 128 MB where as 3.2.4.Final is doing the same start up times at around 700 MB to 1024 MB.

franz1981 commented 1 year ago

Which seems to suggest:

either the live set is increased and the available one to allocate young gen obj is reduced; which means that less memory will make more work to be moved into the slower major/mixed pauses (JVM mode eh), and more moving/compacting activity from the GC perspective (which, with less CPU resources, code compilation will kick in slower, and application code will pause more frequently)
we allocate more volatile/temporary objects and having less memory would make more obj to prematurely become tenured (because surviving more, If their life cycles is short both not short enough): the effects on the CPU resources are similar to the previous point, given that GC will behave the same

I will stop my wild guesses and waiting the @gsmet analysis

Thanks for sharing the data with him already

galderz commented 1 year ago

But 2.15.1.Final is doing faster start up times at 128 MB where as 3.2.4.Final is doing the same start up times at around 700 MB to 1024 MB.

What are these memory sizes you mention? -Xmx values? Container memory size limits?

galderz commented 1 year ago

Also, can you post here what you sent to @gsmet?

Karthikrao121 commented 1 year ago

These are memory allocations on AWS lambda.

gsmet commented 1 year ago

@Karthikrao121 sorry, I just found your email in my spam folder after my PTO. I think what you sent is dev mode profiling which makes it hard to pinpoint potential slow downs in your app itself. Also given you found the regression with 3.2.4.Final, could you stick to this version? We have a known regression in 3.3.0 so that will make it harder to track what you noticed in 3.2.4 already.

What we would need is this one: https://github.com/quarkusio/quarkus/blob/main/TROUBLESHOOTING.md#profiling-application-startup-with-async-profiler , profiling startup by running the application jar.

Send me a flamegraph with 2.x and the same app with 3.2.4.Final.

Thanks!

franz1981 commented 1 year ago

@gsmet still on PTO but to save troubles due to different architecture they can use https://github.com/jvm-profiling-tools/ap-loader which pack different binaries in. In addition I suggest to produce jfr output which allow to extract much more info via jfr2flame conversion eg --classify, --lines and others... @gsmet we can have a quick call in 2 days when I am back too

gsmet commented 1 year ago

Let's start simple first. And if we need additional instructions in TROUBLESHOOTING.md, we should work on it and make them as easy to follow as possible.

geoand commented 1 year ago

And if we need additional instructions in TROUBLESHOOTING.md, we should work on it and make them as easy to follow as possible.

+1

franz1981 commented 1 year ago

Yeah, my point is to save users to capture data over and over: If they capture the data in jfr format we can extract different types of flamegraphs ourselves, without requiring the users to collect more...

franz1981 commented 11 months ago

@Karthikrao121 any update?

@radcortez worked hard on improving few aspects re performance config which had a tremendous impact on both native and JVM mode startup time; I'm not aware which specific versions you can try and check if deliver the expected level of performance, maybe Roberto or @gsmet knows, and If you will give it a shot this can give us an additional (indipendent) feedback.

Additionally, in our https://quarkus.io/guides/performance-measure we explained that startup time is for us what makes the application available to users, which imply awaiting the first request completion. This, compared to what quarkus print as startup time, include few more class loading and initializations which can affect the final numbers: did you perform the same measurements in your startup time tests? I am saying that, because this is likely the measure you really are interested in.

radcortez commented 11 months ago

@radcortez worked hard on improving few aspects re performance config which had a tremendous impact on both native and JVM mode startup time; I'm not aware which specific versions you can try and check if deliver the expected level of performance, maybe Roberto or @gsmet knows, and If you will give it a shot this can give us an additional (indipendent) feedback.

Yes, we have made a bunch of improvements to both start time and RSS in both JVM and Native mode. A few of these improvements are already available in 3.4.x, but the big chunk should only be available in 3.5.x.

Karthikrao121 commented 11 months ago

Gsmet,

The startup time decreased with 3.4.1 version around 100 ms . Sorry for the late reply. We have seen some improvements with the newer version.

On Mon, Sep 25, 2023, 04:26 Roberto Cortez @.***> wrote:

@radcortez https://github.com/radcortez worked hard on improving few aspects re performance config which had a tremendous impact on both native and JVM mode startup time; I'm not aware which specific versions you can try and check if deliver the expected level of performance, maybe Roberto or @gsmet https://github.com/gsmet knows, and If you will give it a shot this can give us an additional (indipendent) feedback.

Yes, we have made a bunch of improvements to both start time and RSS in both JVM and Native mode. A few of these improvements are already available in 3.4.x, but the big chunk should only be available in 3.5.x.

— Reply to this email directly, view it on GitHub https://github.com/quarkusio/quarkus/issues/35392#issuecomment-1733274739, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACXWFT54OLZALHKER4Z3GPTX4FE5ZANCNFSM6AAAAAA3TT4OFI . You are receiving this because you were mentioned.Message ID: @.***>

radcortez commented 11 months ago

Gsmet, The startup time decreased with 3.4.1 version around 100 ms . Sorry for the late reply. We have seen some improvements with the newer version. …

Good to hear; thanks for lettings us know.

Some additional improvements are coming in https://github.com/quarkusio/quarkus/pull/35322

radcortez commented 6 months ago

I believe this can be closed due to the improvements done in:

quarkusio / quarkus

3.2.4.Final is slower compared to 2.15.1.Final #35392

Description

Implementation ideas