Open viniciusxyz opened 1 year ago
Performance drops this significant certainly deserve to be filed as an issue, so you've already done the right thing. Thank you, @viniciusxyz! :smiley:
Can you reproduce this without using Micronaut, for example by running jwtService.getFakeToken()
in a simple loop with 10_000 iterations?
In terms of how to "improve the situation", I'd suggest you try GraalVM EE and enable PGO. It'd be interesting to see how much this can improve performance in your case.
@fniephaus @farquet I added the test without using micronaut in the issue description, but in summary we have a scenario where using JIT a loop with 10_000 interactions takes about 8500 milliseconds and with native compilation the same execution takes about 41000 milliseconds.
In the native version I used the native build tools in the same version that we use and with the same settings, although some of them for this test are useless like adding http/https, but I preferred to leave it the same so that you can analyze it completely.
About the use of PGO, is there any way to use graalvm enterprise for free for this test?
I performed two more tests:
Java 19: Execution times decreased overall but the disparity remained, the JIT version running around 7 seconds and the native version around 37 seconds
Jwt library changed from jjwt to fusion: The times remained practically the same as the executions described in the issue, I ran the tests to validate that there was no problem directly related to this library.
These two test scenarios were made on top of the non-micronaut version and are on branches with respective names, I hope this helps at least a little.
About the use of PGO, is there any way to use graalvm enterprise for free for this test?
@viniciusxyz It's totally fine to use EE for testing purposes like this. So I encourage you to try and report numbers as suggested by Fabio. That could help us pinpoint if the problem is located in the CE compiler or if it's a more general problem of turning the program into a native image.
@viniciusxyz You could also try compiling the native image with full debug info and then running it under perf to measure where the cpu time is going. Taken on its own the perf report might provide some insight into which parts of the program are taking a long time to execute. Comparison with a perf report derived from running the same program on OpenJDK will not give an apples for apples comparison. However, it might well help to identify the cause of the disparity you are seeing.
@farquet Tests done, sorry I didn't know how to use ee for free for testing.
Results:
Graalvm EE Jit
RUN 1 - 7074 RUN 2 - 7127 RUN 3 - 7112
Graalvm EE Native withot PGO
RUN 1 - 31489 RUN 2 - 31790 RUN 3 - 30703
Graalvm EE Native with PGO
RUN 1 - 26119 RUN 2 - 26496 RUN 3 - 26486
As we can see, using PGO brings improvements in execution time, but the times are still far from those found in using jit.
@adinn Could you tell me where I can find documentation that explains how to do this process? I'm fully available to help identify the cause of the slowdown, but unfortunately I'm still a newbie when it comes to this kind of deeper analysis with graalvm.
How to build a full debug image is described in the DebugInfo.md reference manual for native image:
https://github.com/oracle/graal/blob/master/docs/reference-manual/native-image/DebugInfo.md
If you build with the latest Graal and the recommended options then perf should be able to show you time spent in top level and inline methods identified using the full Java name and also show you a source code level view with time spent at each line.
A perf newbie? Try man perf
or read about it on the kernel perf wiki
https://perf.wiki.kernel.org/index.php/Tutorial#Sampling_with_perf_record
https://perf.wiki.kernel.org/index.php/Tutorial#Sample_analysis_with_perf_report
https://perf.wiki.kernel.org/index.php/Tutorial#Source_level_analysis_with_perf_annotate
@adinn Here are the results, if you need anything else or if I've done something wrong please let me know.
Perf record result:
Perf annotate result (summary):
Perf data file:
Build Args:
As I said earlier, I kept some settings that may not make sense for this scenario, I left it like this to be the same as my application in production.
<buildArgs>
--no-fallback
--no-server
--enable-http
--enable-https
--enable-all-security-services
--install-exit-handlers
-Dfile.encoding=UTF8
-H:+ReportExceptionStackTraces
-H:DeadlockWatchdogInterval=0
--verbose
-g
-H:+SourceLevelDebug
-H:-DeleteLocalSymbols
-H:+AddAllCharsets
-H:+PrintClassInitialization
-H:+PrintAnalysisCallTree
-H:GenerateDebugInfo=2
-H:IncludeResources=.*.(pem)
</buildArgs>
Hi @viniciusxyz . Thanks for posting the perf output. It looks like your test code is spending most of its time doing formatting of decimal numbers. That sounds odd to me. I don't know if this is happening in the library code that does the signing but on the face of it I would not expect a signature routine to have to generate decimal print strings.
I suggest you investigate the app on the JVM using a debugger to see where the app calls DecimalFormat.format. This may indicate where the problem lies.
I also suggest you use perf to profile your test app in the JVM to see if where it spends most of its execution time. If the JVM does not appear to be doing Decimal formatting most of the time then that might indicate that GraalVM is compiling the Decimal format routines very inefficiently. If that is the bottleneck for both GraalVM and JVM then it may just be that the compiler is doing a bad job on one some specific Decimal format methods that happen to be called a lot. You should be able to work that out by looking at ho wmuch more time is spent in a method on Graal compared to the JVM.
@adinn I apologize, I made some errors in data collection, here is the correct data:
Perf record result:
Perf annotate result (summary):
Perf data file:
@adinn I thought it better to separate the results to make it easier to visualize, here is the perf executed over the jar, to be able to run the application and collect the data I added an in scanner to be able to press enter to start the test, in case I find any calls related to this for please disregard.
Perf record result (summary):
Perf annotate result (summary):
Perf data file:
BigInteger.implMulAdd
is an @IntrinsicCandidate
, and the HotSpot VM has an optimized hand-written assembly intrinsic for it. When running the GraalVM JIT on HotSpot, we use this intrinsic. When building a native image, we compile the plain Java version, and this gives less optimized code than the intrinsic.
We could probably be smarter about matching the "add with carry" pattern, but I think the real solution here is to port the intrinsic for use in native images as well. @mur47x111 do you want to tackle this?
@gergo- You are correct that Hotspot provides intrinsics to improve performance of the BigInteger routines. It would be very good to do the same in GraalVM. However, I think implementing them will only take you some of the way towards the Hotspot performance for this test. Hotspot also has intrinsics for the montgomery multiply routines and those routines seem to be getting called here.
I will take care of porting all relevant intrinsics. You may use -H:+WarnMissingIntrinsic
to list all methods marked with @ IntrinsicCandidate
but does not get intrinsified in SVM. Note that some of the reported methods are false positives because we may not have them in HotSpot in the current architecture.
@mur47x111 If a test version is released I am available to perform the tests described in the issue again.
@viniciusxyz could you please test whether the newest GraalVM release address this issue? I have ported implMultiplyToLen, implMulAdd, and implSquareToLen stubs.
@viniciusxyz could you please test whether the newest GraalVM release address this issue? I have ported implMultiplyToLen, implMulAdd, and implSquareToLen stubs.
I will perform the new tests tomorrow, sorry for the delay.
@mur47x111 The new tests follow.
There were considerable improvements, reducing the execution time to less than half, but maybe there is still some way to improve, since even with PGO the time is about 2x greater than the JIT.
JIT
Result:
RUN 1 - 6993ms
RUN 2 - 6945ms
RUN 3 - 6938ms
Native SerialGC
Result:
RUN 1 - 18214ms
RUN 2 - 18318ms
RUN 3 - 18288ms
Native PGO SerialGC
RUN 1 - 18068ms
RUN 2 - 17991ms
RUN 3 - 18098ms
Native PGO G1GC
RUN 1 - 18290ms
RUN 2 - 18119ms
RUN 3 - 18186ms
Perf record
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 74K of event 'cycles'
# Event count (approx.): 97891020906
#
# Overhead Command Shared Object Symbol
# ........ ............... ................... .............................................
#
10.81% fake-oauth-jwt- fake-oauth-jwt-loop [.] 0x000000000051edd5
9.20% fake-oauth-jwt- fake-oauth-jwt-loop [.] 0x000000000051edce
8.75% fake-oauth-jwt- fake-oauth-jwt-loop [.] 0x000000000051eda6
8.40% fake-oauth-jwt- fake-oauth-jwt-loop [.] 0x000000000051ed9f
7.67% fake-oauth-jwt- fake-oauth-jwt-loop [.] 0x000000000051edc7
7.19% fake-oauth-jwt- fake-oauth-jwt-loop [.] 0x000000000051ed86
4.57% fake-oauth-jwt- fake-oauth-jwt-loop [.] 0x000000000051f129
3.05% fake-oauth-jwt- fake-oauth-jwt-loop [.] 0x00000000008d9663
3.02% fake-oauth-jwt- fake-oauth-jwt-loop [.] 0x000000000051f130
2.75% fake-oauth-jwt- fake-oauth-jwt-loop [.] 0x000000000051f116
1.74% fake-oauth-jwt- fake-oauth-jwt-loop [.] 0x00000000008d9619
1.63% fake-oauth-jwt- fake-oauth-jwt-loop [.] 0x000000000051ed8f
1.44% fake-oauth-jwt- fake-oauth-jwt-loop [.] 0x000000000051eddc
Perf annotate
Percent | Source code & Disassembly of fake-oauth-jwt-loop for cycles (78 samples, percent: local period)
---------------------------------------------------------------------------------------------------------------
:
:
:
: 3 Disassembly of section .text:
:
: 5 0000000000108000 <__svm_code_section>:
: 6 __svm_code_section():
0.00 : 108000: lea -0x48(%rsp),%rbx
0.00 : 108005: cmp 0x8(%r15),%rbx
0.00 : 108009: jbe 418100 <graal_vm_locator_symbol+0x49ba0>
0.00 : 10800f: mov %rbx,%rsp
0.00 : 108012: cmp %r14,%rsi
0.00 : 108015: je 108347 <__svm_code_section+0x347>
0.00 : 10801b: mov %rdi,0x38(%rsp)
Perf data:
thanks for the data! I will take a look
Hi @mur47x111
I was wondering if further optimizations are being made since this issue is still open?
We are currently looking into the potential benefits of using Virtual Threads in a native image application and we noticed that the JIT based version of our application performs substantially better. The application is also also creating signatures with underlaying BigInteger operations.
This is using GRAALVM 23.1 JDK 21.0.2+13-jvmci-23.1-b30.
@mur47x111 To reinforce, I still have a lot of interest in these optimizations too, although performance has already improved, there is still a lot of margin since the performance of the example I reported is still far from the performance with JIT
Describe the issue
During the conversion of an OAuth2 server to native compilation, we had a very significant reduction in throughput, which I find strange since other converted applications do not present this discrepancy, I believe that this may be related to the signature of the token with an RSA certificate, is there any way available to improve the performance of this type of workload using the tools present in the graalvm community ?
Comparative results Micronaut
The application issues a jwt token signed with an RS256 certificate with a get in /oauth/token.
Results:
JIT Without limitation
docker run -p 8080:8080 -e CERT_LOCATION=/cert/key1.pem fake-oauth-jit
Native Without limitation
docker run -p 8080:8080 -e CERT_LOCATION=/cert/key1.pem fake-oauth-native
Production limit
JIT - Docker 300 milicore
docker run -p 8080:8080 -e CERT_LOCATION=/cert/key1.pem --cpus=0.3 fake-oauth-jit
Native - Docker 300 milicore
docker run -p 8080:8080 -e CERT_LOCATION=/cert/key1.pem --cpus=0.3 fake-oauth-native
Comparative results without Micronaut
JIT
Execution:
Result:
Native
Execution:
Result:
Steps to reproduce the issue
Micronaut
Please include both build steps as well as run steps
Build and run performance test Jit:
Build and run performance test native:
Describe GraalVM and your environment:
Without Micronaut
Build and run performance test Jit:
Build and run performance test native:
Describe GraalVM and your environment:
Considerations
The reason for opening the issue is that the performance drops to around 70% in scenarios where we have a CPU limitation, I would like to know if this is the expected behavior for this type of scenario and if not, how can I improve this situation with the resources present in the graalvm community?
If I have opened the issue incorrectly or if you need any extra information please let me know and I will correct it as soon as possible.