oracle / graal

GraalVM compiles Java applications into native executables that start instantly, scale fast, and use fewer compute resources 🚀
https://www.graalvm.org
Other
20.21k stars 1.62k forks source link

Worrying performance degradation in comparison to HotSpot #2542

Open informarte opened 4 years ago

informarte commented 4 years ago

Introduction

I am working on Yuck, a solver for combinatorial optimization problems specified using the MiniZinc language. Implementation language is Scala, but I don't use any FP library, my style is mainly OOP, using plain while loops when performance matters, but I also enjoy the amenities of the Scala libary when possible. I normally use OpenJDK 11 but recently decided to create a native image and test its performance.

Description of issue

To test performance, I used a big test suite with 160 problems where the solver's runtime is limited to five minutes per problem, with a total runtime for the entire suite of approx. 8:30 hours. I ran this suite with

The following chart shows the speedup distributions (speedup with respect to HotSpot) in various ways. (The solver uses local search and, to measure performance, I measure moves per second.)

hotspot-vs-graalvm-vs-native-image

It turned out that GraalVM is a bit faster overall than HotSpot but there are some outliers to the left, i.e. some problems where we can observe a severe performance degradation. The native image, however, is 50% slower on average (in terms of moves per second) than HotSpot. That's not exactly good news but at this point I am more worried about the many outliers to the left with an extreme case where performance dropped by three orders of magnitude! (Interestingly, the problems on which GraalVM performed poorly are a subset of those which are troubling the native image.)

Steps to reproduce this issue

To setup for reproducing the issue, execute the following steps:

wget --continue https://github.com/MiniZinc/MiniZincIDE/releases/download/2.4.3/MiniZincIDE-2.4.3-bundle-linux-x86_64.tgz
tar xzf MiniZincIDE-2.4.3-bundle-linux-x86_64.tgz
export PATH=$PWD/MiniZincIDE-2.4.3-bundle-linux-x86_64/bin:$PATH
minizinc --version
git clone --recursive https://github.com/informarte/yuck
cd yuck
git checkout native-image-performance-issue

Now you can use the Makefile to apply the solver to benchmarks problems, e.g.

make NATIVE=0 yuck.flatzinc.test.MiniZincChallenge2013#black_hole_4

(NATIVE=1 will build and run the native image.)

A run will leave log files in the project's tmp folder. In particular, there will be a yuck.json file with performance stats, among them moves-per-second.

With these tools, you can try different VMs and compare their performance to each other and to the native image. (Notice that, to rerun the solver on a problem, you have to move the log files away, otherwise it will not run!)

These are five outlier problems (with speedups in comparison to HotSpot) you can play with:

Problem GraalVM speedup Native-image speedup
yuck.flatzinc.test.MiniZincChallenge2013#black_hole_4 0.66 0.24
yuck.flatzinc.test.MiniZincChallenge2017#groupsplitter_u5g1pref0 0.64 0.22
yuck.flatzinc.test.MiniZincChallenge2017#opt_cryptanalysis_r5 0.75 0.25
yuck.flatzinc.test.MiniZincChallenge2013#proteindesign12_2TRX_11p_8aa_usingEref_self 0.41 0.14
yuck.flatzinc.test.MiniZincChallenge2017#ma_path_finding_ins_g16_p10_a10 0.68 0.001

Environment

gergo- commented 4 years ago

Tracking this internally as GR-24262.

gergo- commented 4 years ago

Hi @informarte, thank you for these very interesting benchmarks. I'm sorry it took me so long to start taking a detailed look. I can confirm that GraalVM CE has some slowdowns compared to OpenJDK on the outliers you highlight. On my machine these are less severe than on yours, but I will keep investigating.

What I believe I can already say is that the slowdown on MiniZincChallenge2013#proteindesign12_2TRX_11p_8aa_usingEref_self is mostly due to an optimization that is not present in, and not planned for, GraalVM CE. The next version of GraalVM EE should be considerably faster than OpenJDK on this benchmark.

I have also tried to replicate your native-image numbers, but running the make command with NATIVE=1 the build fails with:

Exception in thread "main" com.oracle.svm.core.jdk.UnsupportedFeatureError: Invoke with MethodHandle argument could not be reduced to at most a single call or single field access. The method handle must be a compile time constant, e.g., be loaded from a `static final` field. Method that contains the method handle invocation: java.lang.invoke.LambdaForm$MH/774882201.invoke_MT(Object, Object)
    at com.oracle.svm.core.util.VMError.unsupportedFeature(VMError.java:86)
    at scala.runtime.Statics.releaseFence(Statics.java:148)
    at scala.collection.immutable.Vector.<init>(Vector.scala:135)
        ...

Before I dig further, can you confirm that you can cleanly build native images at commit c62d0cd in your native-image-performance-issue branch? Also, could you clarify exactly which version of GraalVM you tested? Your report first mentions 20.0.0, then 20.1.0.

gergo- commented 4 years ago

Ah, I see that commit 5fc58fd seems to allow a correct native image build.

gergo- commented 4 years ago

With respect to the apparent 0.001x speedup of native-image vs. OpenJDK on yuck.flatzinc.test.MiniZincChallenge2017#ma_path_finding_ins_g16_p10_a10, the problem seems to lie in the benchmarking methodology. On this problem the "Generating solver" step takes rather long -- on my machine, about 1 minute on GraalVM EE 20.1.0 vs. about 4 minutes with the native image. The default runtime limit is 300 seconds (5 minutes), so the native image only has 1 minute to do actual solving work, which is the relevant workload. Increasing the runtime limit to 10 minutes, the native image's "moves per second" increase dramatically to roughly 0.5x JIT speed.

Apparently the "moves per second" measure is affected by solver generation time, and/or the actual solver might go through various phases, with early phases doing fewer moves per second than later ones. It might make sense to have a benchmarking mode that excludes the solver generation time from the runtime limit, since in this case the result was misleading: nothing was actually 1000x slower.

As for the 4x slowdown on the "Generating solver" phase, the hot part of this is a tight loop in Scala's HashSetIterator. The loop contains some null and array bounds checks that native-image currently doesn't hoist out of the loop like the GraalVM JIT does. We are aware of such limitations in native-image, and this is a focus of ongoing work.

informarte commented 4 years ago

Regarding the ma_path_finding_ins_g16_p10_a10 instance, I investigated what happens in the early phase of solving. It turned out that the poor performance (in terms of moves per seconds) results from a basic property of the solver: Whenever the solver finds a better proposal, it clones the current assignment (of values to variables, a hash map) and stores it (as the best proposal found so far). Usually, this approach does not hurt but in this case it does because the problem has a lot of variables (> 313000) and the solver finds a lot of improvements quickly (> 11000 in the first minute on my test hardware).

gergo- commented 4 years ago

Hello again @informarte . Thank you for analyzing the path_finding benchmark and confirming that the hash map accesses are a hot part of the computation. (Though it looks like I had misunderstood the phase in which this was relevant in the benchmark.) In our ongoing work on native image performance I will definitely keep an eye on this problem. I will report back once we have results on this. Regarding JIT performance, @mur47x111 recently committed 63cb636, fixing a performance bug that affected groupsplitter and proteindesign. These now have somewhat higher and especially more consistent performance. The speedups I measured on these benchmarks (average of 5 runs) were previously 0.67 and 0.45, respectively. These are now 0.86 and 0.52. As noted above, the proteindesign case would need an optimization that is currently only available in GraalVM Enterprise Edition. In fact, my GraalVM EE numbers on these benchmarks are 1.48 and 1.21, respectively. If at some point you decide to re-run the whole benchmark set, I encourage you to also give GraalVM EE a try. In any case, if you try again (maybe after the 20.3 release in November), please do let us know about the results.

informarte commented 3 years ago

@gergo-, I wanted to repeat my performance test with native-image 20.2.0 and the current dev branch, but ran into this issue: https://github.com/oracle/graal/issues/2928

informarte commented 3 years ago

@gergo-, I repeated my experiment using the latest versions of everything (HotSpot-11.0.10, GraalVM 21.1.0, MiniZinc 2.5.5, Yuck 20210501). Executive summary: Results have not changed a lot; GraalVM was a bit slower than HotSpot and the native image was about 50% slower than the VMs: graalvm I have not yet tried GraalVM Enterprise because it not clear to me how to obtain a license for work outside of a company context and whether the resulting binary could be distributed without restrictions.

gergo- commented 3 years ago

Hi @informarte , thank you for the update. I have been meaning to run these benchmarks again to see how we are doing. I guess the most problematic cases for native-image are still the five that you highlighted above? I will benchmark this again to see what we are missing. The focus will certainly be on GraalVM EE, though it appears that CE is doing worse relative to HotSpot than before, so there might be some regressions to catch up on. Could you let me know especially which benchmark is now the slowest outlier at < 0.1x in JIT mode? (Also, is there a script in the repository that produces these excellent graphs?)

As for EE, you can get it for free "for testing, evaluation, or for developing non-production applications under the Oracle Technology Network License Agreement for GraalVM Enterprise Edition" (see https://www.graalvm.org/support/) via https://www.graalvm.org/downloads/. I can't give you legal advice beyond quoting that, unfortunately.

informarte commented 3 years ago

Hi @gergo-, a closer look revealed that the runs graalvm-21.1.0-ce and native-image-21.1.0-ce share only one outlier problem.

These are the graalvm-21.1.0-ce outliers:

Problem Speedup
yuck.flatzinc.test.MiniZincChallenge2017#ma_path_finding_ins_g16_p10_a10 0.06
yuck.flatzinc.test.MiniZincChallenge2018#proteindesign12_1PGB_11p_9aa_usingEref_self_x 0.59

These are the native-image-21.1.0-ce outliers:

Problem Speedup
yuck.flatzinc.test.MiniZincChallenge2015#nmseq_83 0.038
yuck.flatzinc.test.MiniZincChallenge2017#opd_flener_et_al_15_350_100 0.19
yuck.flatzinc.test.MiniZincChallenge2018#proteindesign12_1PGB_11p_9aa_usingEref_self_x 0.21
yuck.flatzinc.test.MiniZincChallenge2019#nside_EASY_200_50 0.33
yuck.flatzinc.test.MiniZincChallenge2020#skill_allocation_1m_1 0.34
yuck.flatzinc.test.MiniZincChallenge2019#rcpsp_j30_27_5_wet_diverse_3 0.36
yuck.flatzinc.test.MiniZincChallenge2018#opt_cryptanalysis_r10 0.39
yuck.flatzinc.test.MiniZincChallenge2013#black_hole_4 0.41
yuck.flatzinc.test.MiniZincChallenge2012#filter_fir_1_1 0.44
yuck.flatzinc.test.MiniZincChallenge2013#pentominoes_int_02 0.68

Regarding the ma_path_finding_ins_g16_p10_a10 instance: In the native-image-21.1.0-ce run, the FlatZinc compiler took too long and the search was not even started, so there is no speedup value. In the graalvm-21.1.0-ce run, the compiler finished but there were only ten seconds left for the solver, so the speeup value is misleading for the same reasons I explained in https://github.com/oracle/graal/issues/2542#issuecomment-683442700. On HotSpot, however, compilation took only three minutes, so two minutes were left for solving.

For all other outliers, the solver was running long enough to produce meaningful results.

To setup for running experiments yourself, proceed as follows:

wget --continue https://github.com/MiniZinc/MiniZincIDE/releases/download/2.5.5/MiniZincIDE-2.5.5-bundle-linux-x86_64.tgz
tar xzf MiniZincIDE-2.5.5-bundle-linux-x86_64.tgz
export PATH=$PWD/MiniZincIDE-2.5.5-bundle-linux-x86_64/bin:$PATH
minizinc --version
git clone --recursive https://github.com/informarte/yuck
cd yuck
git checkout spikes/native-image

To run a single test case in JIT mode, use the following command:

make yuck.flatzinc.test.MiniZincChallenge2012#filter_fir_1_1

To run a single test case in native mode, you need Makefile.native.test:

make -f Makefile.native.test yuck.flatzinc.test.MiniZincChallenge2012#filter_fir_1_1

A run will leave log files in the project's tmp folder. In particular, there will be a yuck.json file with performance stats, among them moves-per-second. (Notice that, to rerun the solver on a problem, you have to move the log files away, otherwise it will not run!)

To run the entire test suite, use:

make yuck.flatzinc.test.MiniZincChallenges

This will take about 12 hours.

Now say you have made two runs, one with HotSpot and one with GraalVM and the results are in the folders hotspot-results and graalvm-results. To compare these runs and to generate a speeup graph, proceed as follows:

  1. Create a results database using imports-results.py:
./scripts/import-results.py --db cmp.db hotspot $(find hotspot-results -name yuck.json)
./scripts/import-results.py --db cmp.db graalvm $(find graalvm-results -name yuck.json)
  1. Compute speedup stats using compute-speedup.py:
./scripts/compute-speedup.py --min-runtime 10 -p --db cmp.db hotspot graalvm