oracle / graal

GraalVM compiles Java applications into native executables that start instantly, scale fast, and use fewer compute resources 🚀
https://www.graalvm.org
Other
20.38k stars 1.63k forks source link

GraalVM optimizer fails to remove dead code. #898

Open LeickR opened 5 years ago

LeickR commented 5 years ago

I ran a benchmark that calculates a Fibonacci-style sequence which includes source code of the form: c = a + b d = a + b << 1 (This represents an attempt to hand-optimize the algorithm for superscalar parallelism.)

To avoid loop optimization issues, the loop in the benchmark method has been hand unrolled, and it calculates the next 60 integers in the sequence.

However, the compiled assembly generated by the GraalVM optimizer includes “dead” mov instructions that unnecessarily move each intermediate result to memory. These are overwritten without ever being read.

In contrast, the HotSpot optimizer compiles to assembly that is virtually identical, except that these dead mov instructions have been removed.

Note that, in my tests, these spurious mov instructions appear to have been “hidden” by superscalar parallelism and so did not affect performance. However, they could exhibit a noticeable performance effect on other cores. In either case, these instructions result in spurious writes to the cache that burn power unnecessarily.

GraalVM EE version: JDK 1.8.0_192, GraalVM 1.0.0-rc9, 25.192-b12-jvmci-0.49. HotSpot version: JDK 11.0.1, Java HotSpot(TM) 64-Bit Server VM, 11.0.1+13-LTS.

All tests were run on an X5-2 server equipped with two Intel E5-2690 v3 CPUs @ 2.60GHz 12-core processors.

In the attached JMH benchmark, the relevant benchmark method is fibonacci_para2.

LeickR commented 5 years ago

Fibonacci.txt

LeickR commented 5 years ago

I re-ran the benchmark on a number of other processors (all in the Xeon family) to see if the uneliminated “dead” code might result in a performance loss on a different processor.

I found that it does, in fact, result in a 9% loss of performance (compared to HotSpot) when running on a Xeon E5-2660, confirming my suspicion that these spurious mov instructions could result in a significant performance loss on some cores.

tkrodriguez commented 5 years ago

Yes I noticed that in https://github.com/oracle/graal/issues/897. Removal of those is required to get good performance from using the lea pattern. We aren't that aggressive about redundant store removal in generic code since PEA tends to remove the most common sources of this.