Closed plokhotnyuk closed 5 years ago
Screenshot of the reverse flame graph that was build using GraalVM EE:
The same graph with JDK 8:
@thomaswue @axel22 I see that the regression is still the same with rc2...
Should I attach ASM listings of hottest code using perfasm profiler of JMH? Would it help to move on?
Thanks for testing again. @axel22 What is your current assessment?
I am currently looking at the inliner correctness for Stream programs like the following, and trying to ensure that we get loop-equivalent performance:
return Arrays.stream(values)
.map(x -> x + 1)
.map(x -> x * 2)
.map(x -> x + 5)
.reduce(0, Double::sum);
I am also trying to make progress on creating a new economy compilation tier for Truffle, which should ensure faster warmups.
Thus, I was so far unable to make progress on this.
OK. It would be good to just check whether we can reproduce this and assess what type of issue that is.
I doubt that our performance problem is how we compile that switch statement, but you might revisit the design of that code if you think it matters to your overall performance. Hashing the whole string first and then switching on that value seems expensive to me particularly when you still have to check the string for equality after the match. nested switches based on common prefixes seems like a more performant design. Given the expense of creating the actual tree of JSON objects it seems likely this switch doesn't really show up in the final performance.
@thomaswue it is reproducible on the current master of the benchmark repo with the latest release of GraalVMs. Just need to clone the repo and run the following commands for different JVMs:
sbt -java-home /usr/lib/jvm/graalvm-ce-19.0.0 -no-colors 'jsoniter-scala-benchmark/jmh:run -prof perfnorm -f 3 TwitterAPIReading.jsoniterScala'
sbt -java-home /usr/lib/jvm/graalvm-ee-19.0.0 -no-colors 'jsoniter-scala-benchmark/jmh:run -prof perfnorm -f 3 TwitterAPIReading.jsoniterScala'
sbt -java-home /usr/lib/jvm/jdk1.8.0 -no-colors 'jsoniter-scala-benchmark/jmh:run -prof perfnorm -f 3 TwitterAPIReading.jsoniterScala'
Below are results of running of those commands, which show that there are lot of differences in number of instructions, branches, loads/stores and cache misses:
[info] REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
[info] why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
[info] experiments, perform baseline and negative tests that provide experimental control, make sure
[info] the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
[info] Do not assume the numbers tell you what you want them to tell.
[info] Benchmark Mode Cnt Score Error Units
[info] TwitterAPIReading.jsoniterScala thrpt 15 23636.236 ± 676.064 ops/s
[info] TwitterAPIReading.jsoniterScala:CPI thrpt 3 0.360 ± 0.025 #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-load-misses thrpt 3 726.532 ± 242.999 #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-loads thrpt 3 89199.324 ± 1153.728 #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-stores thrpt 3 51248.477 ± 1173.121 #/op
[info] TwitterAPIReading.jsoniterScala:L1-icache-load-misses thrpt 3 4973.498 ± 1634.187 #/op
[info] TwitterAPIReading.jsoniterScala:LLC-load-misses thrpt 3 2.527 ± 12.731 #/op
[info] TwitterAPIReading.jsoniterScala:LLC-loads thrpt 3 20.047 ± 19.666 #/op
[info] TwitterAPIReading.jsoniterScala:LLC-store-misses thrpt 3 5.301 ± 10.900 #/op
[info] TwitterAPIReading.jsoniterScala:LLC-stores thrpt 3 60.140 ± 79.107 #/op
[info] TwitterAPIReading.jsoniterScala:branch-misses thrpt 3 209.127 ± 198.793 #/op
[info] TwitterAPIReading.jsoniterScala:branches thrpt 3 115353.011 ± 6838.866 #/op
[info] TwitterAPIReading.jsoniterScala:cycles thrpt 3 157823.488 ± 10966.552 #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-load-misses thrpt 3 1.543 ± 7.014 #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-loads thrpt 3 88534.400 ± 4483.634 #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-store-misses thrpt 3 0.134 ± 0.356 #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-stores thrpt 3 50365.229 ± 2203.260 #/op
[info] TwitterAPIReading.jsoniterScala:iTLB-load-misses thrpt 3 0.386 ± 4.072 #/op
[info] TwitterAPIReading.jsoniterScala:iTLB-loads thrpt 3 3.346 ± 13.915 #/op
[info] TwitterAPIReading.jsoniterScala:instructions thrpt 3 437816.905 ± 35249.052 #/op
[info] REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
[info] why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
[info] experiments, perform baseline and negative tests that provide experimental control, make sure
[info] the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
[info] Do not assume the numbers tell you what you want them to tell.
[info] Benchmark Mode Cnt Score Error Units
[info] TwitterAPIReading.jsoniterScala thrpt 15 30575.944 ± 638.000 ops/s
[info] TwitterAPIReading.jsoniterScala:CPI thrpt 3 0.331 ± 0.064 #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-load-misses thrpt 3 555.095 ± 481.524 #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-loads thrpt 3 64733.848 ± 3155.255 #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-stores thrpt 3 28164.718 ± 2249.074 #/op
[info] TwitterAPIReading.jsoniterScala:L1-icache-load-misses thrpt 3 5581.027 ± 1968.013 #/op
[info] TwitterAPIReading.jsoniterScala:LLC-load-misses thrpt 3 2.596 ± 17.303 #/op
[info] TwitterAPIReading.jsoniterScala:LLC-loads thrpt 3 13.961 ± 21.911 #/op
[info] TwitterAPIReading.jsoniterScala:LLC-store-misses thrpt 3 3.512 ± 5.745 #/op
[info] TwitterAPIReading.jsoniterScala:LLC-stores thrpt 3 48.743 ± 37.646 #/op
[info] TwitterAPIReading.jsoniterScala:branch-misses thrpt 3 125.754 ± 145.185 #/op
[info] TwitterAPIReading.jsoniterScala:branches thrpt 3 97468.685 ± 20423.876 #/op
[info] TwitterAPIReading.jsoniterScala:cycles thrpt 3 122485.340 ± 16421.726 #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-load-misses thrpt 3 0.997 ± 1.989 #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-loads thrpt 3 64809.808 ± 7910.100 #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-store-misses thrpt 3 0.119 ± 0.230 #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-stores thrpt 3 28048.143 ± 3689.013 #/op
[info] TwitterAPIReading.jsoniterScala:iTLB-load-misses thrpt 3 0.325 ± 2.958 #/op
[info] TwitterAPIReading.jsoniterScala:iTLB-loads thrpt 3 4.352 ± 31.293 #/op
[info] TwitterAPIReading.jsoniterScala:instructions thrpt 3 370403.086 ± 65471.572 #/op
[info] REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
[info] why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
[info] experiments, perform baseline and negative tests that provide experimental control, make sure
[info] the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
[info] Do not assume the numbers tell you what you want them to tell.
[info] Benchmark Mode Cnt Score Error Units
[info] TwitterAPIReading.jsoniterScala thrpt 15 41052.493 ± 1454.473 ops/s
[info] TwitterAPIReading.jsoniterScala:CPI thrpt 3 0.270 ± 0.038 #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-load-misses thrpt 3 543.691 ± 244.394 #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-loads thrpt 3 48992.868 ± 4904.049 #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-stores thrpt 3 21076.899 ± 2867.058 #/op
[info] TwitterAPIReading.jsoniterScala:L1-icache-load-misses thrpt 3 3879.973 ± 754.438 #/op
[info] TwitterAPIReading.jsoniterScala:LLC-load-misses thrpt 3 1.045 ± 2.341 #/op
[info] TwitterAPIReading.jsoniterScala:LLC-loads thrpt 3 8.944 ± 35.559 #/op
[info] TwitterAPIReading.jsoniterScala:LLC-store-misses thrpt 3 4.180 ± 25.851 #/op
[info] TwitterAPIReading.jsoniterScala:LLC-stores thrpt 3 49.835 ± 20.710 #/op
[info] TwitterAPIReading.jsoniterScala:branch-misses thrpt 3 37.334 ± 29.845 #/op
[info] TwitterAPIReading.jsoniterScala:branches thrpt 3 64094.012 ± 16302.539 #/op
[info] TwitterAPIReading.jsoniterScala:cycles thrpt 3 90018.373 ± 31746.825 #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-load-misses thrpt 3 0.927 ± 7.876 #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-loads thrpt 3 49225.079 ± 12999.812 #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-store-misses thrpt 3 0.381 ± 8.090 #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-stores thrpt 3 20886.815 ± 5977.188 #/op
[info] TwitterAPIReading.jsoniterScala:iTLB-load-misses thrpt 3 0.335 ± 1.676 #/op
[info] TwitterAPIReading.jsoniterScala:iTLB-loads thrpt 3 2.378 ± 9.823 #/op
[info] TwitterAPIReading.jsoniterScala:instructions thrpt 3 333389.098 ± 102994.852 #/op
@thomaswue thanks for your patience and support!
With this change in the code generator I have managed to recover loosing of CPU cycles partially: up to ~40% with CE and up to ~10% with EE.
The main idea of the change is to improve code layout by reducing number of long jumps.
Please see current results:
[info] REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
[info] why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
[info] experiments, perform baseline and negative tests that provide experimental control, make sure
[info] the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
[info] Do not assume the numbers tell you what you want them to tell.
[info] Benchmark Mode Cnt Score Error Units
[info] TwitterAPIReading.jsoniterScala thrpt 15 34804.836 ± 827.140 ops/s
[info] TwitterAPIReading.jsoniterScala:CPI thrpt 3 0.267 ± 0.034 #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-load-misses thrpt 3 607.383 ± 922.230 #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-loads thrpt 3 60188.255 ± 2906.116 #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-stores thrpt 3 22515.241 ± 1953.701 #/op
[info] TwitterAPIReading.jsoniterScala:L1-icache-load-misses thrpt 3 1602.135 ± 817.220 #/op
[info] TwitterAPIReading.jsoniterScala:LLC-load-misses thrpt 3 4.075 ± 18.475 #/op
[info] TwitterAPIReading.jsoniterScala:LLC-loads thrpt 3 11.939 ± 48.896 #/op
[info] TwitterAPIReading.jsoniterScala:LLC-store-misses thrpt 3 5.922 ± 16.304 #/op
[info] TwitterAPIReading.jsoniterScala:LLC-stores thrpt 3 57.829 ± 26.008 #/op
[info] TwitterAPIReading.jsoniterScala:branch-misses thrpt 3 205.188 ± 348.356 #/op
[info] TwitterAPIReading.jsoniterScala:branches thrpt 3 115067.959 ± 27076.913 #/op
[info] TwitterAPIReading.jsoniterScala:cycles thrpt 3 105613.650 ± 6907.248 #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-load-misses thrpt 3 1.492 ± 6.220 #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-loads thrpt 3 60039.994 ± 10648.639 #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-store-misses thrpt 3 0.944 ± 14.335 #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-stores thrpt 3 22311.283 ± 5922.519 #/op
[info] TwitterAPIReading.jsoniterScala:iTLB-load-misses thrpt 3 0.163 ± 0.120 #/op
[info] TwitterAPIReading.jsoniterScala:iTLB-loads thrpt 3 1.396 ± 1.933 #/op
[info] TwitterAPIReading.jsoniterScala:instructions thrpt 3 395353.388 ± 72987.846 #/op
[info] REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
[info] why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
[info] experiments, perform baseline and negative tests that provide experimental control, make sure
[info] the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
[info] Do not assume the numbers tell you what you want them to tell.
[info] Benchmark Mode Cnt Score Error Units
[info] TwitterAPIReading.jsoniterScala thrpt 15 33747.274 ± 587.704 ops/s
[info] TwitterAPIReading.jsoniterScala:CPI thrpt 3 0.313 ± 0.077 #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-load-misses thrpt 3 547.524 ± 301.180 #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-loads thrpt 3 54986.785 ± 4654.468 #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-stores thrpt 3 19000.656 ± 1341.283 #/op
[info] TwitterAPIReading.jsoniterScala:L1-icache-load-misses thrpt 3 4603.189 ± 1387.981 #/op
[info] TwitterAPIReading.jsoniterScala:LLC-load-misses thrpt 3 4.216 ± 11.939 #/op
[info] TwitterAPIReading.jsoniterScala:LLC-loads thrpt 3 14.882 ± 13.446 #/op
[info] TwitterAPIReading.jsoniterScala:LLC-store-misses thrpt 3 3.847 ± 3.793 #/op
[info] TwitterAPIReading.jsoniterScala:LLC-stores thrpt 3 55.342 ± 112.263 #/op
[info] TwitterAPIReading.jsoniterScala:branch-misses thrpt 3 132.921 ± 230.082 #/op
[info] TwitterAPIReading.jsoniterScala:branches thrpt 3 98405.702 ± 15550.655 #/op
[info] TwitterAPIReading.jsoniterScala:cycles thrpt 3 111934.386 ± 20481.795 #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-load-misses thrpt 3 0.902 ± 5.479 #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-loads thrpt 3 55335.472 ± 8220.870 #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-store-misses thrpt 3 0.093 ± 0.770 #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-stores thrpt 3 19010.277 ± 4551.581 #/op
[info] TwitterAPIReading.jsoniterScala:iTLB-load-misses thrpt 3 0.307 ± 2.794 #/op
[info] TwitterAPIReading.jsoniterScala:iTLB-loads thrpt 3 1.788 ± 6.178 #/op
[info] TwitterAPIReading.jsoniterScala:instructions thrpt 3 357542.971 ± 33573.818 #/op
[info] REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
[info] why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
[info] experiments, perform baseline and negative tests that provide experimental control, make sure
[info] the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
[info] Do not assume the numbers tell you what you want them to tell.
[info] Benchmark Mode Cnt Score Error Units
[info] TwitterAPIReading.jsoniterScala thrpt 15 40842.080 ± 1153.936 ops/s
[info] TwitterAPIReading.jsoniterScala:CPI thrpt 3 0.273 ± 0.007 #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-load-misses thrpt 3 554.586 ± 253.163 #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-loads thrpt 3 50458.899 ± 8483.214 #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-stores thrpt 3 22740.528 ± 6493.480 #/op
[info] TwitterAPIReading.jsoniterScala:L1-icache-load-misses thrpt 3 4425.418 ± 2185.880 #/op
[info] TwitterAPIReading.jsoniterScala:LLC-load-misses thrpt 3 3.449 ± 5.512 #/op
[info] TwitterAPIReading.jsoniterScala:LLC-loads thrpt 3 13.108 ± 65.152 #/op
[info] TwitterAPIReading.jsoniterScala:LLC-store-misses thrpt 3 3.426 ± 4.240 #/op
[info] TwitterAPIReading.jsoniterScala:LLC-stores thrpt 3 48.663 ± 36.453 #/op
[info] TwitterAPIReading.jsoniterScala:branch-misses thrpt 3 35.798 ± 50.728 #/op
[info] TwitterAPIReading.jsoniterScala:branches thrpt 3 63215.940 ± 11618.436 #/op
[info] TwitterAPIReading.jsoniterScala:cycles thrpt 3 92350.528 ± 10677.236 #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-load-misses thrpt 3 0.788 ± 1.530 #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-loads thrpt 3 50746.487 ± 5974.060 #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-store-misses thrpt 3 0.694 ± 17.888 #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-stores thrpt 3 22736.890 ± 6076.622 #/op
[info] TwitterAPIReading.jsoniterScala:iTLB-load-misses thrpt 3 0.366 ± 0.724 #/op
[info] TwitterAPIReading.jsoniterScala:iTLB-loads thrpt 3 2.005 ± 11.904 #/op
[info] TwitterAPIReading.jsoniterScala:instructions thrpt 3 338235.195 ± 33053.933 #/op
On other benchmarks this change gives more than 2.5x speed up. Please, see results for reddit-scala.json
where more than 90 fields are parsed in a hot loop.
Also, I found an inspiring post about code layout optimizations: https://easyperf.net/blog/2019/03/27/Machine-code-layout-optimizatoins
Does GraalVM use some of these tricks?
Another limits of Intel and AMD CPUs: https://travisdowns.github.io/blog/2019/06/11/speed-limits.html
Are they took in account for GraalVM?
We do perform code layout optimizations within one method. Specifically, we use block frequency information to lay out the blocks and also align block starts. We don't do interprocedural optimizations for that.
We don't do lower level optimizations for individual instruction scheduling. This is an interesting block post. Thanks for sharing.
@thomaswue Please see how efficiently this small function was compiled by Rustc!
@plokhotnyuk these above reports and graphs are generated some hand-made programs? Do you have them open in your git repo or elsewhere?
@neomatrix369 Hi, Mani!
Flame graphs were generated by sbt-jmh using some command like this that use directories of cloned repos for async-profiler and FlameGraph.
Benchmark results where plotted by JMH Visualizer from *.json
files produced by a command like this.
@thomaswue Thank you for your patience and support!
I'm closing this issue because jsoniter-scala, GraalVM and some other libraries that used in the test are evolved a lot since reporting date.
I have picked historical graphs for the test (which wasn't changed) and run on the same environment. The difference in throughput with OpenJDK 8 reduced from ~70% to ~20% and absolute values increased too.
Some follow up issues that can improve the parsing in this test are here: https://github.com/oracle/graal/issues/1647 https://github.com/oracle/graal/issues/1617
Test name readJsoniterScala
OpenJDK 8 37K ops/s
GraalVM CE 1.0.0-RC1 22K ops/s
Test name jsoniterScala
OpenJDK 8 47K ops/s
GraalVM CE 19.2.0.1 39K ops/s
OK, thank you!
I have a benchmark which parses JSON data from UTF-8 bytes to a structure of nested classes with 20-40 fields.
Code for that is generated by Scala macros and compiled by scalac to long methods with big lookup switches (one for each field) like here: https://gist.github.com/plokhotnyuk/1e2ee19d3cc80c3644bc9e453c8aae77
With GraalVM CE/EE it shows ~2x slowdown comparing to running on Oracle JDK 8.
Steps to reproduce:
I have manadged to run it with Async profiler through sbt-jmh extras integration. Please see flame graph reports, for both JVM which was committed in the following directory: https://github.com/plokhotnyuk/jsoniter-scala/tree/graalvm-slowdown-on-big-method-2/profile-async