Native compilation is much slower than jit when we have certificate signing scenarios

viniciusxyz commented 1 year ago

Describe the issue

During the conversion of an OAuth2 server to native compilation, we had a very significant reduction in throughput, which I find strange since other converted applications do not present this discrepancy, I believe that this may be related to the signature of the token with an RSA certificate, is there any way available to improve the performance of this type of workload using the tools present in the graalvm community ?

Comparative results Micronaut

The application issues a jwt token signed with an RS256 certificate with a get in /oauth/token.

Results:

JIT Without limitation

docker run -p 8080:8080 -e CERT_LOCATION=/cert/key1.pem fake-oauth-jit

Run 1 9221.79/s 
Run 2 9362.92/s
Run 3 9353.73/s

Native Without limitation

docker run -p 8080:8080 -e CERT_LOCATION=/cert/key1.pem fake-oauth-native

Run 1 3590.95/s 
Run 2 3589.23/s
Run 3 3589.47/s

Production limit

JIT - Docker 300 milicore

docker run -p 8080:8080 -e CERT_LOCATION=/cert/key1.pem --cpus=0.3 fake-oauth-jit

Run 1 188.36/s #ignored on average
Run 2 265.62/s
Run 3 314.46/s

Native - Docker 300 milicore

docker run -p 8080:8080 -e CERT_LOCATION=/cert/key1.pem --cpus=0.3 fake-oauth-native

Run 1 53.66/s 
Run 2 54.69/s
Run 3 54.48/s

Comparative results without Micronaut

JIT

Execution:

mvn clean package
java -jar target/fake-oauth-jwt-loop.jar

Result:

RUN 1 - 8496ms
RUN 2 - 8490ms
RUN 3 - 8500ms

Native

Execution:

mvn clean package -Pnative
./target/fake-oauth-jwt-loop

Result:

RUN 1 - 41606ms
RUN 2 - 41718ms
RUN 3 - 42839ms

Steps to reproduce the issue

Micronaut

Please include both build steps as well as run steps

git clone https://github.com/viniciusxyz/fake-oauth2-server.git
(optional) Install K6

Build and run performance test Jit:

mvn clean package
docker build -t fake-oauth-jit -f dockerfile-jit .
k6 run --vus 100 --duration 60s performance-test/k6-get-token.js

Build and run performance test native:

mvn clean package -Dpackaging=native-image
docker build -t fake-oauth-native -f dockerfile-native .
k6 run --vus 100 --duration 60s performance-test/k6-get-token.js

Describe GraalVM and your environment:

GraalVM CE 22.3.0
JDK major version: 11
OS: Ubuntu 22.04
Architecture: AMD64
Framework: Micronaut 3.8.0

Without Micronaut

git clone https://github.com/viniciusxyz/fake-oauth-jwt-loop

Build and run performance test Jit:

mvn clean package
java -jar target/fake-oauth-jwt-loop.jar

Build and run performance test native:

mvn clean package -Pnative
./target/fake-oauth-jwt-loop

Describe GraalVM and your environment:

GraalVM CE 22.3.0
JDK major version: 11
OS: Ubuntu 22.04
Architecture: AMD64

Considerations

The reason for opening the issue is that the performance drops to around 70% in scenarios where we have a CPU limitation, I would like to know if this is the expected behavior for this type of scenario and if not, how can I improve this situation with the resources present in the graalvm community?

If I have opened the issue incorrectly or if you need any extra information please let me know and I will correct it as soon as possible.

fniephaus commented 1 year ago

Performance drops this significant certainly deserve to be filed as an issue, so you've already done the right thing. Thank you, @viniciusxyz! :smiley:

Can you reproduce this without using Micronaut, for example by running jwtService.getFakeToken() in a simple loop with 10_000 iterations?

In terms of how to "improve the situation", I'd suggest you try GraalVM EE and enable PGO. It'd be interesting to see how much this can improve performance in your case.

viniciusxyz commented 1 year ago

@fniephaus @farquet I added the test without using micronaut in the issue description, but in summary we have a scenario where using JIT a loop with 10_000 interactions takes about 8500 milliseconds and with native compilation the same execution takes about 41000 milliseconds.

In the native version I used the native build tools in the same version that we use and with the same settings, although some of them for this test are useless like adding http/https, but I preferred to leave it the same so that you can analyze it completely.

About the use of PGO, is there any way to use graalvm enterprise for free for this test?

viniciusxyz commented 1 year ago

I performed two more tests:

Java 19: Execution times decreased overall but the disparity remained, the JIT version running around 7 seconds and the native version around 37 seconds

Jwt library changed from jjwt to fusion: The times remained practically the same as the executions described in the issue, I ran the tests to validate that there was no problem directly related to this library.

These two test scenarios were made on top of the non-micronaut version and are on branches with respective names, I hope this helps at least a little.

farquet commented 1 year ago

About the use of PGO, is there any way to use graalvm enterprise for free for this test?

@viniciusxyz It's totally fine to use EE for testing purposes like this. So I encourage you to try and report numbers as suggested by Fabio. That could help us pinpoint if the problem is located in the CE compiler or if it's a more general problem of turning the program into a native image.

adinn commented 1 year ago

@viniciusxyz You could also try compiling the native image with full debug info and then running it under perf to measure where the cpu time is going. Taken on its own the perf report might provide some insight into which parts of the program are taking a long time to execute. Comparison with a perf report derived from running the same program on OpenJDK will not give an apples for apples comparison. However, it might well help to identify the cause of the disparity you are seeing.

viniciusxyz commented 1 year ago

@farquet Tests done, sorry I didn't know how to use ee for free for testing.

Results:

Graalvm EE Jit

RUN 1 - 7074 RUN 2 - 7127 RUN 3 - 7112

Graalvm EE Native withot PGO

RUN 1 - 31489 RUN 2 - 31790 RUN 3 - 30703

Graalvm EE Native with PGO

RUN 1 - 26119 RUN 2 - 26496 RUN 3 - 26486

As we can see, using PGO brings improvements in execution time, but the times are still far from those found in using jit.

@adinn Could you tell me where I can find documentation that explains how to do this process? I'm fully available to help identify the cause of the slowdown, but unfortunately I'm still a newbie when it comes to this kind of deeper analysis with graalvm.

adinn commented 1 year ago

How to build a full debug image is described in the DebugInfo.md reference manual for native image:

https://github.com/oracle/graal/blob/master/docs/reference-manual/native-image/DebugInfo.md

If you build with the latest Graal and the recommended options then perf should be able to show you time spent in top level and inline methods identified using the full Java name and also show you a source code level view with time spent at each line.

A perf newbie? Try man perf or read about it on the kernel perf wiki

https://perf.wiki.kernel.org/index.php/Tutorial#Sampling_with_perf_record
https://perf.wiki.kernel.org/index.php/Tutorial#Sample_analysis_with_perf_report
https://perf.wiki.kernel.org/index.php/Tutorial#Source_level_analysis_with_perf_annotate

viniciusxyz commented 1 year ago

@adinn Here are the results, if you need anything else or if I've done something wrong please let me know.

Perf record result:

Perf annotate result (summary):

Perf data file:

perf.zip

Build Args:

As I said earlier, I kept some settings that may not make sense for this scenario, I left it like this to be the same as my application in production.

<buildArgs>
    --no-fallback
    --no-server
    --enable-http
    --enable-https
    --enable-all-security-services
    --install-exit-handlers
    -Dfile.encoding=UTF8
    -H:+ReportExceptionStackTraces
    -H:DeadlockWatchdogInterval=0
    --verbose
    -g
    -H:+SourceLevelDebug
    -H:-DeleteLocalSymbols
    -H:+AddAllCharsets
    -H:+PrintClassInitialization
    -H:+PrintAnalysisCallTree
    -H:GenerateDebugInfo=2
    -H:IncludeResources=.*.(pem)
</buildArgs>

adinn commented 1 year ago

Hi @viniciusxyz . Thanks for posting the perf output. It looks like your test code is spending most of its time doing formatting of decimal numbers. That sounds odd to me. I don't know if this is happening in the library code that does the signing but on the face of it I would not expect a signature routine to have to generate decimal print strings.

I suggest you investigate the app on the JVM using a debugger to see where the app calls DecimalFormat.format. This may indicate where the problem lies.

I also suggest you use perf to profile your test app in the JVM to see if where it spends most of its execution time. If the JVM does not appear to be doing Decimal formatting most of the time then that might indicate that GraalVM is compiling the Decimal format routines very inefficiently. If that is the bottleneck for both GraalVM and JVM then it may just be that the compiler is doing a bad job on one some specific Decimal format methods that happen to be called a lot. You should be able to work that out by looking at ho wmuch more time is spent in a method on Graal compared to the JVM.

viniciusxyz commented 1 year ago

@adinn I apologize, I made some errors in data collection, here is the correct data:

Perf record result:

Perf annotate result (summary):

Perf data file:

perf.zip

viniciusxyz commented 1 year ago

@adinn I thought it better to separate the results to make it easier to visualize, here is the perf executed over the jar, to be able to run the application and collect the data I added an in scanner to be able to press enter to start the test, in case I find any calls related to this for please disregard.

Perf record result (summary):

Perf annotate result (summary):

Perf data file:

perf.zip

gergo- commented 1 year ago

BigInteger.implMulAdd is an @IntrinsicCandidate, and the HotSpot VM has an optimized hand-written assembly intrinsic for it. When running the GraalVM JIT on HotSpot, we use this intrinsic. When building a native image, we compile the plain Java version, and this gives less optimized code than the intrinsic. We could probably be smarter about matching the "add with carry" pattern, but I think the real solution here is to port the intrinsic for use in native images as well. @mur47x111 do you want to tackle this?

adinn commented 1 year ago

@gergo- You are correct that Hotspot provides intrinsics to improve performance of the BigInteger routines. It would be very good to do the same in GraalVM. However, I think implementing them will only take you some of the way towards the Hotspot performance for this test. Hotspot also has intrinsics for the montgomery multiply routines and those routines seem to be getting called here.

mur47x111 commented 1 year ago

I will take care of porting all relevant intrinsics. You may use -H:+WarnMissingIntrinsic to list all methods marked with @ IntrinsicCandidate but does not get intrinsified in SVM. Note that some of the reported methods are false positives because we may not have them in HotSpot in the current architecture.

viniciusxyz commented 1 year ago

@mur47x111 If a test version is released I am available to perform the tests described in the issue again.

mur47x111 commented 1 year ago

@viniciusxyz could you please test whether the newest GraalVM release address this issue? I have ported implMultiplyToLen, implMulAdd, and implSquareToLen stubs.

viniciusxyz commented 1 year ago

@viniciusxyz could you please test whether the newest GraalVM release address this issue? I have ported implMultiplyToLen, implMulAdd, and implSquareToLen stubs.

I will perform the new tests tomorrow, sorry for the delay.

viniciusxyz commented 1 year ago

@mur47x111 The new tests follow.

There were considerable improvements, reducing the execution time to less than half, but maybe there is still some way to improve, since even with PGO the time is about 2x greater than the JIT.

GraalVM 17.0.7+8.1

JIT

Result:

RUN 1 - 6993ms
RUN 2 - 6945ms
RUN 3 - 6938ms

Native SerialGC

Result:

RUN 1 - 18214ms
RUN 2 - 18318ms
RUN 3 - 18288ms

Native PGO SerialGC

RUN 1 - 18068ms
RUN 2 - 17991ms
RUN 3 - 18098ms

Native PGO G1GC

RUN 1 - 18290ms
RUN 2 - 18119ms
RUN 3 - 18186ms

Perf record

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 74K of event 'cycles'
# Event count (approx.): 97891020906
#
# Overhead  Command          Shared Object        Symbol
# ........  ...............  ...................  .............................................
#
    10.81%  fake-oauth-jwt-  fake-oauth-jwt-loop  [.] 0x000000000051edd5
     9.20%  fake-oauth-jwt-  fake-oauth-jwt-loop  [.] 0x000000000051edce
     8.75%  fake-oauth-jwt-  fake-oauth-jwt-loop  [.] 0x000000000051eda6
     8.40%  fake-oauth-jwt-  fake-oauth-jwt-loop  [.] 0x000000000051ed9f
     7.67%  fake-oauth-jwt-  fake-oauth-jwt-loop  [.] 0x000000000051edc7
     7.19%  fake-oauth-jwt-  fake-oauth-jwt-loop  [.] 0x000000000051ed86
     4.57%  fake-oauth-jwt-  fake-oauth-jwt-loop  [.] 0x000000000051f129
     3.05%  fake-oauth-jwt-  fake-oauth-jwt-loop  [.] 0x00000000008d9663
     3.02%  fake-oauth-jwt-  fake-oauth-jwt-loop  [.] 0x000000000051f130
     2.75%  fake-oauth-jwt-  fake-oauth-jwt-loop  [.] 0x000000000051f116
     1.74%  fake-oauth-jwt-  fake-oauth-jwt-loop  [.] 0x00000000008d9619
     1.63%  fake-oauth-jwt-  fake-oauth-jwt-loop  [.] 0x000000000051ed8f
     1.44%  fake-oauth-jwt-  fake-oauth-jwt-loop  [.] 0x000000000051eddc

Perf annotate

 Percent |      Source code & Disassembly of fake-oauth-jwt-loop for cycles (78 samples, percent: local period)
---------------------------------------------------------------------------------------------------------------
         :
         :
         :
         : 3      Disassembly of section .text:
         :
         : 5      0000000000108000 <__svm_code_section>:
         : 6      __svm_code_section():
    0.00 :   108000: lea    -0x48(%rsp),%rbx
    0.00 :   108005: cmp    0x8(%r15),%rbx
    0.00 :   108009: jbe    418100 <graal_vm_locator_symbol+0x49ba0>
    0.00 :   10800f: mov    %rbx,%rsp
    0.00 :   108012: cmp    %r14,%rsi
    0.00 :   108015: je     108347 <__svm_code_section+0x347>
    0.00 :   10801b: mov    %rdi,0x38(%rsp)

Perf data:

perf.zip

mur47x111 commented 1 year ago

thanks for the data! I will take a look

sys-jdi commented 7 months ago

Hi @mur47x111

I was wondering if further optimizations are being made since this issue is still open?

We are currently looking into the potential benefits of using Virtual Threads in a native image application and we noticed that the JIT based version of our application performs substantially better. The application is also also creating signatures with underlaying BigInteger operations.

This is using GRAALVM 23.1 JDK 21.0.2+13-jvmci-23.1-b30.

viniciusxyz commented 7 months ago

@mur47x111 To reinforce, I still have a lot of interest in these optimizations too, although performance has already improved, there is still a lot of margin since the performance of the example I reported is still far from the performance with JIT

oracle / graal

Native compilation is much slower than jit when we have certificate signing scenarios #5697

GraalVM 17.0.7+8.1