compare some benchmarks 2.11 vs 2.12 vs Scala.js

lrytz commented 8 years ago

While looking at some benchmarks, I found that Sudoku (taken from https://github.com/jonas/scala-js-benchmarks) runs slower on 2.12 than 2.11 (with no optimizers enabled): 1500 vs 1300.

get some reliable numbers (jmh?)
find out differences in byecode (maybe it's the library that's slower, not the application bytecode?)

also look at other benchmarks. for example, try to find out why the scala-js optimizer makes "Richards" 3x faster (https://youtu.be/IvB1APFZK5Q?t=4m2s), while the 2.11 and 2.12 optimizers don't change anything -- is the scala-js optimizer doing things that help only on th JS-VMs, or would they also improve perf on the JVM?

TODO

take a look at GenBCode, see if we can easily improve the generated branching instructions
run some other benchmarks than "Richards"
try to understand improvments in scala.js

lrytz commented 8 years ago

Here's the "Richards" benchmark written using JMH: https://github.com/lrytz/benchmarks/blob/master/src/main/scala/misc/Richards.scala

On my machine:

2.11.7
[info] Richards.run  avgt   10  0.113 ± 0.001  ms/op

2.11.7, -optimise
[info] Richards.run  avgt   10  0.113 ± 0.002  ms/op

2.11.7, -Ybackend:GenBCode
[info] Richards.run  avgt   10  0.126 ± 0.003  ms/op

2.12.0-newopt
[info] Richards.run  avgt   10  0.126 ± 0.004  ms/op

2.12.0-newopt, -Yopt:l:classpath
[info] Richards.run  avgt   10  0.113 ± 0.003  ms/op

On my slow linux box (Celeron N3050):

2.11.7
[info] Richards.run  avgt   10  0.437 ± 0.003  ms/op

2.11.7, -optimise
[info] Richards.run  avgt   10  0.439 ± 0.008  ms/op

2.11.7, -Ybackend:GenBCode
[info] Richards.run  avgt   10  0.427 ± 0.003  ms/op

2.12.0-newopt
[info] Richards.run  avgt   10  0.431 ± 0.006  ms/op

2.12.0-newopt, , -Yopt:l:classpath
[info] Richards.run  avgt   10  0.436 ± 0.005  ms/op

Observations

The code produced by GenBCode seems to run slower on my machine, but not on the linux box. Details below.
Neither the old nor the new optimizer improve performance

I looked at the bytecode produced by 2.11.7 with GenASM and GenBCode. It's in this repo: https://github.com/lrytz/richardsBenchBytecode/commits/master.

The differences are

different branching patterns (IFNONNULL vs IFNULL), leading to a few more jumping instructions in GenBCode
some unnecessary ACONST_NULL; POP sequences in GenBcode

Enabling the new optimizer cleans up the jumps and removes the additional ones. This seems bring the performance back to GenASM level on my machine: 2.12.0-newopt-Yopt:l:classpath has the same speed as 2.11.7-GenASM. Again, on the linux box, we don't see any of that.

lrytz commented 8 years ago

Sudoku

My machine:

2.11.7
[info] Sudoku.run  avgt   10  1.103 ± 0.014  ms/op

2.11.7, -Ybackend:GenBCode
[info] Sudoku.run  avgt   10  1.113 ± 0.014  ms/op

2.11.7, -optimise
[info] Sudoku.run  avgt   10  1.101 ± 0.019  ms/op

2.12.0-newopt
[info] Sudoku.run  avgt   10  1.120 ± 0.010  ms/op

2.12.0-newopt, -Yopt:l:classpath
[info] Sudoku.run  avgt   10  1.120 ± 0.013  ms/op

Linux box:

2.11.7
[info] Sudoku.run  avgt   10  3.808 ± 0.295  ms/op

2.11.7, -Ybackend:GenBCode
[info] Sudoku.run  avgt   10  3.999 ± 0.298  ms/op

2.11.7, -optimise
[info] Sudoku.run  avgt   10  3.902 ± 0.211  ms/op

2.12.0-newopt
[info] Sudoku.run  avgt   10  3.919 ± 0.207  ms/op

2.12.0-newopt, -Yopt:l:classpath
[info] Sudoku.run  avgt   10  3.929 ± 0.266  ms/op

the situation looks very similar to "richards"

lrytz commented 8 years ago

Some more findings

The hot spot in "Sudoku" is the inner most loop, which is a for-comprehension with an if, resulting in a WithFilter.map call, which is a bit slow. See here https://github.com/lrytz/benchmarks/blob/4d43af04c060285e2e717a7b00262ac4d921a3f5/src/main/scala/misc/Sudoku.scala#L125.
The scala.js optimizer has a 2.1x speedup on richards. Most of the speedup comes from inlining the for-comprehension over a range (https://github.com/lrytz/benchmarks/blob/4d43af04c060285e2e717a7b00262ac4d921a3f5/src/main/scala/misc/Richards.scala#L393). Doing that manually already gives most of the speedup. On the jvm, replacing the for-comp by a while doesn't change performance (it's the only range-foreach in the program, so there's no megamorphism).

SethTisue commented 5 years ago

@lrytz this seems stale, should it stay open?

scala / scala-dev

compare some benchmarks 2.11 vs 2.12 vs Scala.js #81