Performance regresion on long methods with big lookup switches

plokhotnyuk commented 6 years ago

I have a benchmark which parses JSON data from UTF-8 bytes to a structure of nested classes with 20-40 fields.

Code for that is generated by Scala macros and compiled by scalac to long methods with big lookup switches (one for each field) like here: https://gist.github.com/plokhotnyuk/1e2ee19d3cc80c3644bc9e453c8aae77

With GraalVM CE/EE it shows ~2x slowdown comparing to running on Oracle JDK 8.

Steps to reproduce:

Ensure that sbt is installed
Clone project with the benchmark that reproduces the issue:
```
git clone git@github.com:plokhotnyuk/jsoniter-scala.git
```

Switch to the branch:

git checkout graalvm-slowdown-on-big-method-2

Run the benchmark on Oracle JDK 8 using specified path to it like here:

sbt clean 'benchmark/jmh:run -jvm /usr/lib/jvm/java-8-oracle/bin/java .*TwitterAPIBenchmark.readJsoniterScala.*'

Run the benchmark on GraalVM EE using specified path to it like here:

sbt clean 'benchmark/jmh:run -jvm /usr/lib/jvm/graalvm-ee-1.0.0-rc1/bin/java .*TwitterAPIBenchmark.readJsoniterScala.*'

Compare results of throughput (in ops/sec).

I have manadged to run it with Async profiler through sbt-jmh extras integration. Please see flame graph reports, for both JVM which was committed in the following directory: https://github.com/plokhotnyuk/jsoniter-scala/tree/graalvm-slowdown-on-big-method-2/profile-async

plokhotnyuk commented 6 years ago

Screenshot of the reverse flame graph that was build using GraalVM EE:

plokhotnyuk commented 6 years ago

The same graph with JDK 8:

plokhotnyuk commented 6 years ago

@thomaswue @axel22 I see that the regression is still the same with rc2...

Should I attach ASM listings of hottest code using perfasm profiler of JMH? Would it help to move on?

thomaswue commented 6 years ago

Thanks for testing again. @axel22 What is your current assessment?

axel22 commented 6 years ago

I am currently looking at the inliner correctness for Stream programs like the following, and trying to ensure that we get loop-equivalent performance:

    return Arrays.stream(values)
      .map(x -> x + 1)
      .map(x -> x * 2)
      .map(x -> x + 5)
      .reduce(0, Double::sum);

I am also trying to make progress on creating a new economy compilation tier for Truffle, which should ensure faster warmups.

Thus, I was so far unable to make progress on this.

thomaswue commented 6 years ago

OK. It would be good to just check whether we can reproduce this and assess what type of issue that is.

tkrodriguez commented 5 years ago

I doubt that our performance problem is how we compile that switch statement, but you might revisit the design of that code if you think it matters to your overall performance. Hashing the whole string first and then switching on that value seems expensive to me particularly when you still have to check the string for equality after the match. nested switches based on common prefixes seems like a more performant design. Given the expense of creating the actual tree of JSON objects it seems likely this switch doesn't really show up in the final performance.

plokhotnyuk commented 5 years ago

@thomaswue it is reproducible on the current master of the benchmark repo with the latest release of GraalVMs. Just need to clone the repo and run the following commands for different JVMs:

sbt -java-home /usr/lib/jvm/graalvm-ce-19.0.0 -no-colors 'jsoniter-scala-benchmark/jmh:run -prof perfnorm -f 3 TwitterAPIReading.jsoniterScala'
sbt -java-home /usr/lib/jvm/graalvm-ee-19.0.0 -no-colors 'jsoniter-scala-benchmark/jmh:run -prof perfnorm -f 3 TwitterAPIReading.jsoniterScala'
sbt -java-home /usr/lib/jvm/jdk1.8.0 -no-colors 'jsoniter-scala-benchmark/jmh:run -prof perfnorm -f 3 TwitterAPIReading.jsoniterScala'

Below are results of running of those commands, which show that there are lot of differences in number of instructions, branches, loads/stores and cache misses:

GraalVM CE 19.0.0

[info] REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
[info] why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
[info] experiments, perform baseline and negative tests that provide experimental control, make sure
[info] the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
[info] Do not assume the numbers tell you what you want them to tell.
[info] Benchmark                                               Mode  Cnt       Score       Error  Units
[info] TwitterAPIReading.jsoniterScala                        thrpt   15   23636.236 ±   676.064  ops/s
[info] TwitterAPIReading.jsoniterScala:CPI                    thrpt    3       0.360 ±     0.025   #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-load-misses  thrpt    3     726.532 ±   242.999   #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-loads        thrpt    3   89199.324 ±  1153.728   #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-stores       thrpt    3   51248.477 ±  1173.121   #/op
[info] TwitterAPIReading.jsoniterScala:L1-icache-load-misses  thrpt    3    4973.498 ±  1634.187   #/op
[info] TwitterAPIReading.jsoniterScala:LLC-load-misses        thrpt    3       2.527 ±    12.731   #/op
[info] TwitterAPIReading.jsoniterScala:LLC-loads              thrpt    3      20.047 ±    19.666   #/op
[info] TwitterAPIReading.jsoniterScala:LLC-store-misses       thrpt    3       5.301 ±    10.900   #/op
[info] TwitterAPIReading.jsoniterScala:LLC-stores             thrpt    3      60.140 ±    79.107   #/op
[info] TwitterAPIReading.jsoniterScala:branch-misses          thrpt    3     209.127 ±   198.793   #/op
[info] TwitterAPIReading.jsoniterScala:branches               thrpt    3  115353.011 ±  6838.866   #/op
[info] TwitterAPIReading.jsoniterScala:cycles                 thrpt    3  157823.488 ± 10966.552   #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-load-misses       thrpt    3       1.543 ±     7.014   #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-loads             thrpt    3   88534.400 ±  4483.634   #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-store-misses      thrpt    3       0.134 ±     0.356   #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-stores            thrpt    3   50365.229 ±  2203.260   #/op
[info] TwitterAPIReading.jsoniterScala:iTLB-load-misses       thrpt    3       0.386 ±     4.072   #/op
[info] TwitterAPIReading.jsoniterScala:iTLB-loads             thrpt    3       3.346 ±    13.915   #/op
[info] TwitterAPIReading.jsoniterScala:instructions           thrpt    3  437816.905 ± 35249.052   #/op

GraalVM EE 19.0.0

[info] REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
[info] why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
[info] experiments, perform baseline and negative tests that provide experimental control, make sure
[info] the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
[info] Do not assume the numbers tell you what you want them to tell.
[info] Benchmark                                               Mode  Cnt       Score       Error  Units
[info] TwitterAPIReading.jsoniterScala                        thrpt   15   30575.944 ±   638.000  ops/s
[info] TwitterAPIReading.jsoniterScala:CPI                    thrpt    3       0.331 ±     0.064   #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-load-misses  thrpt    3     555.095 ±   481.524   #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-loads        thrpt    3   64733.848 ±  3155.255   #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-stores       thrpt    3   28164.718 ±  2249.074   #/op
[info] TwitterAPIReading.jsoniterScala:L1-icache-load-misses  thrpt    3    5581.027 ±  1968.013   #/op
[info] TwitterAPIReading.jsoniterScala:LLC-load-misses        thrpt    3       2.596 ±    17.303   #/op
[info] TwitterAPIReading.jsoniterScala:LLC-loads              thrpt    3      13.961 ±    21.911   #/op
[info] TwitterAPIReading.jsoniterScala:LLC-store-misses       thrpt    3       3.512 ±     5.745   #/op
[info] TwitterAPIReading.jsoniterScala:LLC-stores             thrpt    3      48.743 ±    37.646   #/op
[info] TwitterAPIReading.jsoniterScala:branch-misses          thrpt    3     125.754 ±   145.185   #/op
[info] TwitterAPIReading.jsoniterScala:branches               thrpt    3   97468.685 ± 20423.876   #/op
[info] TwitterAPIReading.jsoniterScala:cycles                 thrpt    3  122485.340 ± 16421.726   #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-load-misses       thrpt    3       0.997 ±     1.989   #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-loads             thrpt    3   64809.808 ±  7910.100   #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-store-misses      thrpt    3       0.119 ±     0.230   #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-stores            thrpt    3   28048.143 ±  3689.013   #/op
[info] TwitterAPIReading.jsoniterScala:iTLB-load-misses       thrpt    3       0.325 ±     2.958   #/op
[info] TwitterAPIReading.jsoniterScala:iTLB-loads             thrpt    3       4.352 ±    31.293   #/op
[info] TwitterAPIReading.jsoniterScala:instructions           thrpt    3  370403.086 ± 65471.572   #/op

JDK 1.8.0

[info] REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
[info] why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
[info] experiments, perform baseline and negative tests that provide experimental control, make sure
[info] the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
[info] Do not assume the numbers tell you what you want them to tell.
[info] Benchmark                                               Mode  Cnt       Score        Error  Units
[info] TwitterAPIReading.jsoniterScala                        thrpt   15   41052.493 ±   1454.473  ops/s
[info] TwitterAPIReading.jsoniterScala:CPI                    thrpt    3       0.270 ±      0.038   #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-load-misses  thrpt    3     543.691 ±    244.394   #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-loads        thrpt    3   48992.868 ±   4904.049   #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-stores       thrpt    3   21076.899 ±   2867.058   #/op
[info] TwitterAPIReading.jsoniterScala:L1-icache-load-misses  thrpt    3    3879.973 ±    754.438   #/op
[info] TwitterAPIReading.jsoniterScala:LLC-load-misses        thrpt    3       1.045 ±      2.341   #/op
[info] TwitterAPIReading.jsoniterScala:LLC-loads              thrpt    3       8.944 ±     35.559   #/op
[info] TwitterAPIReading.jsoniterScala:LLC-store-misses       thrpt    3       4.180 ±     25.851   #/op
[info] TwitterAPIReading.jsoniterScala:LLC-stores             thrpt    3      49.835 ±     20.710   #/op
[info] TwitterAPIReading.jsoniterScala:branch-misses          thrpt    3      37.334 ±     29.845   #/op
[info] TwitterAPIReading.jsoniterScala:branches               thrpt    3   64094.012 ±  16302.539   #/op
[info] TwitterAPIReading.jsoniterScala:cycles                 thrpt    3   90018.373 ±  31746.825   #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-load-misses       thrpt    3       0.927 ±      7.876   #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-loads             thrpt    3   49225.079 ±  12999.812   #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-store-misses      thrpt    3       0.381 ±      8.090   #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-stores            thrpt    3   20886.815 ±   5977.188   #/op
[info] TwitterAPIReading.jsoniterScala:iTLB-load-misses       thrpt    3       0.335 ±      1.676   #/op
[info] TwitterAPIReading.jsoniterScala:iTLB-loads             thrpt    3       2.378 ±      9.823   #/op
[info] TwitterAPIReading.jsoniterScala:instructions           thrpt    3  333389.098 ± 102994.852   #/op

plokhotnyuk commented 5 years ago

@thomaswue thanks for your patience and support!

With this change in the code generator I have managed to recover loosing of CPU cycles partially: up to ~40% with CE and up to ~10% with EE.

The main idea of the change is to improve code layout by reducing number of long jumps.

Please see current results:

GraalVM CE 19.0.0

[info] REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
[info] why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
[info] experiments, perform baseline and negative tests that provide experimental control, make sure
[info] the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
[info] Do not assume the numbers tell you what you want them to tell.
[info] Benchmark                                               Mode  Cnt       Score       Error  Units
[info] TwitterAPIReading.jsoniterScala                        thrpt   15   34804.836 ±   827.140  ops/s
[info] TwitterAPIReading.jsoniterScala:CPI                    thrpt    3       0.267 ±     0.034   #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-load-misses  thrpt    3     607.383 ±   922.230   #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-loads        thrpt    3   60188.255 ±  2906.116   #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-stores       thrpt    3   22515.241 ±  1953.701   #/op
[info] TwitterAPIReading.jsoniterScala:L1-icache-load-misses  thrpt    3    1602.135 ±   817.220   #/op
[info] TwitterAPIReading.jsoniterScala:LLC-load-misses        thrpt    3       4.075 ±    18.475   #/op
[info] TwitterAPIReading.jsoniterScala:LLC-loads              thrpt    3      11.939 ±    48.896   #/op
[info] TwitterAPIReading.jsoniterScala:LLC-store-misses       thrpt    3       5.922 ±    16.304   #/op
[info] TwitterAPIReading.jsoniterScala:LLC-stores             thrpt    3      57.829 ±    26.008   #/op
[info] TwitterAPIReading.jsoniterScala:branch-misses          thrpt    3     205.188 ±   348.356   #/op
[info] TwitterAPIReading.jsoniterScala:branches               thrpt    3  115067.959 ± 27076.913   #/op
[info] TwitterAPIReading.jsoniterScala:cycles                 thrpt    3  105613.650 ±  6907.248   #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-load-misses       thrpt    3       1.492 ±     6.220   #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-loads             thrpt    3   60039.994 ± 10648.639   #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-store-misses      thrpt    3       0.944 ±    14.335   #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-stores            thrpt    3   22311.283 ±  5922.519   #/op
[info] TwitterAPIReading.jsoniterScala:iTLB-load-misses       thrpt    3       0.163 ±     0.120   #/op
[info] TwitterAPIReading.jsoniterScala:iTLB-loads             thrpt    3       1.396 ±     1.933   #/op
[info] TwitterAPIReading.jsoniterScala:instructions           thrpt    3  395353.388 ± 72987.846   #/op

GraalVM EE 19.0.0

[info] REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
[info] why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
[info] experiments, perform baseline and negative tests that provide experimental control, make sure
[info] the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
[info] Do not assume the numbers tell you what you want them to tell.
[info] Benchmark                                               Mode  Cnt       Score       Error  Units
[info] TwitterAPIReading.jsoniterScala                        thrpt   15   33747.274 ±   587.704  ops/s
[info] TwitterAPIReading.jsoniterScala:CPI                    thrpt    3       0.313 ±     0.077   #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-load-misses  thrpt    3     547.524 ±   301.180   #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-loads        thrpt    3   54986.785 ±  4654.468   #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-stores       thrpt    3   19000.656 ±  1341.283   #/op
[info] TwitterAPIReading.jsoniterScala:L1-icache-load-misses  thrpt    3    4603.189 ±  1387.981   #/op
[info] TwitterAPIReading.jsoniterScala:LLC-load-misses        thrpt    3       4.216 ±    11.939   #/op
[info] TwitterAPIReading.jsoniterScala:LLC-loads              thrpt    3      14.882 ±    13.446   #/op
[info] TwitterAPIReading.jsoniterScala:LLC-store-misses       thrpt    3       3.847 ±     3.793   #/op
[info] TwitterAPIReading.jsoniterScala:LLC-stores             thrpt    3      55.342 ±   112.263   #/op
[info] TwitterAPIReading.jsoniterScala:branch-misses          thrpt    3     132.921 ±   230.082   #/op
[info] TwitterAPIReading.jsoniterScala:branches               thrpt    3   98405.702 ± 15550.655   #/op
[info] TwitterAPIReading.jsoniterScala:cycles                 thrpt    3  111934.386 ± 20481.795   #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-load-misses       thrpt    3       0.902 ±     5.479   #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-loads             thrpt    3   55335.472 ±  8220.870   #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-store-misses      thrpt    3       0.093 ±     0.770   #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-stores            thrpt    3   19010.277 ±  4551.581   #/op
[info] TwitterAPIReading.jsoniterScala:iTLB-load-misses       thrpt    3       0.307 ±     2.794   #/op
[info] TwitterAPIReading.jsoniterScala:iTLB-loads             thrpt    3       1.788 ±     6.178   #/op
[info] TwitterAPIReading.jsoniterScala:instructions           thrpt    3  357542.971 ± 33573.818   #/op

JDK 1.8.0

[info] REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
[info] why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
[info] experiments, perform baseline and negative tests that provide experimental control, make sure
[info] the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
[info] Do not assume the numbers tell you what you want them to tell.
[info] Benchmark                                               Mode  Cnt       Score       Error  Units
[info] TwitterAPIReading.jsoniterScala                        thrpt   15   40842.080 ±  1153.936  ops/s
[info] TwitterAPIReading.jsoniterScala:CPI                    thrpt    3       0.273 ±     0.007   #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-load-misses  thrpt    3     554.586 ±   253.163   #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-loads        thrpt    3   50458.899 ±  8483.214   #/op
[info] TwitterAPIReading.jsoniterScala:L1-dcache-stores       thrpt    3   22740.528 ±  6493.480   #/op
[info] TwitterAPIReading.jsoniterScala:L1-icache-load-misses  thrpt    3    4425.418 ±  2185.880   #/op
[info] TwitterAPIReading.jsoniterScala:LLC-load-misses        thrpt    3       3.449 ±     5.512   #/op
[info] TwitterAPIReading.jsoniterScala:LLC-loads              thrpt    3      13.108 ±    65.152   #/op
[info] TwitterAPIReading.jsoniterScala:LLC-store-misses       thrpt    3       3.426 ±     4.240   #/op
[info] TwitterAPIReading.jsoniterScala:LLC-stores             thrpt    3      48.663 ±    36.453   #/op
[info] TwitterAPIReading.jsoniterScala:branch-misses          thrpt    3      35.798 ±    50.728   #/op
[info] TwitterAPIReading.jsoniterScala:branches               thrpt    3   63215.940 ± 11618.436   #/op
[info] TwitterAPIReading.jsoniterScala:cycles                 thrpt    3   92350.528 ± 10677.236   #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-load-misses       thrpt    3       0.788 ±     1.530   #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-loads             thrpt    3   50746.487 ±  5974.060   #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-store-misses      thrpt    3       0.694 ±    17.888   #/op
[info] TwitterAPIReading.jsoniterScala:dTLB-stores            thrpt    3   22736.890 ±  6076.622   #/op
[info] TwitterAPIReading.jsoniterScala:iTLB-load-misses       thrpt    3       0.366 ±     0.724   #/op
[info] TwitterAPIReading.jsoniterScala:iTLB-loads             thrpt    3       2.005 ±    11.904   #/op
[info] TwitterAPIReading.jsoniterScala:instructions           thrpt    3  338235.195 ± 33053.933   #/op

On other benchmarks this change gives more than 2.5x speed up. Please, see results for reddit-scala.json where more than 90 fields are parsed in a hot loop.

plokhotnyuk commented 5 years ago

Also, I found an inspiring post about code layout optimizations: https://easyperf.net/blog/2019/03/27/Machine-code-layout-optimizatoins

Does GraalVM use some of these tricks?

plokhotnyuk commented 5 years ago

Another limits of Intel and AMD CPUs: https://travisdowns.github.io/blog/2019/06/11/speed-limits.html

Are they took in account for GraalVM?

thomaswue commented 5 years ago

We do perform code layout optimizations within one method. Specifically, we use block frequency information to lay out the blocks and also align block starts. We don't do interprocedural optimizations for that.

We don't do lower level optimizations for individual instruction scheduling. This is an interesting block post. Thanks for sharing.

plokhotnyuk commented 5 years ago

@thomaswue Please see how efficiently this small function was compiled by Rustc!

neomatrix369 commented 5 years ago

@plokhotnyuk these above reports and graphs are generated some hand-made programs? Do you have them open in your git repo or elsewhere?

plokhotnyuk commented 5 years ago

@neomatrix369 Hi, Mani!

Flame graphs were generated by sbt-jmh using some command like this that use directories of cloned repos for async-profiler and FlameGraph.

Benchmark results where plotted by JMH Visualizer from *.json files produced by a command like this.

plokhotnyuk commented 5 years ago

@thomaswue Thank you for your patience and support!

I'm closing this issue because jsoniter-scala, GraalVM and some other libraries that used in the test are evolved a lot since reporting date.

I have picked historical graphs for the test (which wasn't changed) and run on the same environment. The difference in throughput with OpenJDK 8 reduced from ~70% to ~20% and absolute values increased too.

Some follow up issues that can improve the parsing in this test are here: https://github.com/oracle/graal/issues/1647 https://github.com/oracle/graal/issues/1617

2018-05-09

Test name             readJsoniterScala
OpenJDK 8             37K ops/s
GraalVM CE 1.0.0-RC1  22K ops/s

2019-10-02

Test name               jsoniterScala
OpenJDK 8               47K ops/s
GraalVM CE 19.2.0.1     39K ops/s

thomaswue commented 5 years ago

OK, thank you!

oracle / graal