oracle / graal

GraalVM compiles Java applications into native executables that start instantly, scale fast, and use fewer compute resources 🚀
https://www.graalvm.org
Other
19.99k stars 1.6k forks source link

strong performance regression between graalVM EE 8 and EE 11 #3232

Closed LifeIsStrange closed 5 months ago

LifeIsStrange commented 3 years ago

As can be seen on the benchmarks of the fastest serialization library on the jvm: https://plokhotnyuk.github.io/jsoniter-scala/

munishchouhan commented 3 years ago

@LifeIsStrange thanks for reporting the issue @dougxc Please have a look and advise

dougxc commented 3 years ago

We will look into this. Thanks for pointing it out. Note that the default GC changed between 8 and 11 which may explain some of the difference. There are also of course many other changes between 8 and 11 (e.g. runtime, library changes) which may explain the differences so I think characterizing this as a strong performance regression is not quite right.

dougxc commented 3 years ago

@plokhotnyuk would it be possible to take one or two of the more serious regressions and see if they reproduce while using -XX:UseSerialGC on 11? That would at least help rule out or pinpoint one obvious candidate for the difference.

tkrodriguez commented 3 years ago

That graph presentation is very busy without a lot of controls for display. If you clone those pages with wget you can tweak the underlying data to get pairwise comparisons which look much better. I used wget --mirror --convert-links --page-requisites --no-parent -P . https://plokhotnyuk.github.io/jsoniter-scala/ and then if you edit plokhotnyuk.github.io/jsoniter-scala/provided.js you can change the value of providedBenchmarks to reorder or restrict which JDKs you see. Specifying only 2 results give you a nicer bar char which shows relative performance. Like this for example: 11vsEE1 You can also get these nice pairwise summaries. Within the same release EE looks very good relative to Coretto.

8 11

There are some cases of major regressions that might deserve some investigation. Comparing Corretto 8 and 11 shows a fair number of regressions suggesting that general JDK changes are having negative impacts on the benchmarks.

Corretto8v11

Doing the same comparison within EE suggests that a lot of cases where we used to win in 8 are no longer so good with 11.

EE8v11

It seems like looking at major regressions between EE 8 and 11 that aren't also Corretto regressions might be the place to start. A little scripting could probably extract a more readable table that shows the potentially interesting ones.

mur47x111 commented 3 years ago

The performance difference between Java 8 and Java 11 is likely due to the String.getBytes call at com.github.plokhotnyuk.jsoniter_scala.core.JsonWriter.writeNonEscapedAsciiKey (JsonWriter.scala:144) [bci:107].

On Java 8, String.getBytes copies the bytes using a while loop:

        while (i < n) {
            dst[j++] = (byte)val[i++];
        }

which will be fully unrolled given that this call in the application code always asks for 4 bytes. On Java 11, String.getBytes redirects to StringLatin1.getBytes, which uses System.arraycopy instead and calls into runtime stub, and consequently much slower.

LifeIsStrange commented 3 years ago

Great find! Is there a way to fix that in openjdk/graalvm ? Maybe that @cl4es would be interested to take a look as he was recently working on optimizing openjdk string/charset performance (cf: https://cl4es.github.io/2021/02/23/Faster-Charset-Decoding.html )

plokhotnyuk commented 3 years ago

@plokhotnyuk would it be possible to take one or two of the more serious regressions and see if they reproduce while using -XX:UseSerialGC on 11? That would at least help rule out or pinpoint one obvious candidate for the difference.

On all JVMs the -XX:+UseParallelGC options was turned on. Full list of JVM options that are set for all JVMs is here.

LifeIsStrange commented 2 years ago

@mur47x111 Hi, I was wondering if you could take another look at this. Your String.getBytes hypothesis is a great one but it has been intrinsified to AVX in JDK 18 https://cl4es.github.io/2021/10/17/Faster-Charset-Encoding.html and already was somewhat optimized in JDK 17? If so then it's strange because yes JDK 17 is almost as fast as JDK 8 but still, had the JDK 8 to 11 performance regression been totally fixed, JDK performance should probably not be so stagnant, considering it received many optimizations in all those releases so I theorize either there were another regression in JDK 8 that still impacts JDK 17, either the .getBytes regression is only sufficiently fixed in JDK 18? (we don't have updated jsoniter Scala benchmarks for that yet)

Note BTW that even arraycopy might be intrisified nowadays https://github.com/openjdk/jdk/pull/61 But it's unclear to le wether there is an AVX 256 version or wether it's only for AVX 512 CPUs

mur47x111 commented 2 years ago

Thanks for the info. I will take a look and update later.

mur47x111 commented 2 years ago

@LifeIsStrange I have merged a bunch of intrinsics including some charset-related ones https://github.com/oracle/graal/commit/47d1fb1556360c68907324e8b9f06b1651f5d1f1 . Could you please test if it would help addressing this regression?

LifeIsStrange commented 2 years ago

I cannot but maybe that @plokhotnyuk can

mur47x111 commented 5 months ago

Please re-open if the performance regression is still observable.