mutable.HashMap performance regression in Scala 2.12

fwbrasil commented 6 years ago

We're currently migrating from Scala 2.11.11 to 2.12.6 at Twitter and observed a considerable performance regression. One of the factors seems to be more costly mutable hash map lookups. I've copied HashMapBenchmark and compared the results:

Scala 2.11.11

[info] Benchmark                         (size)  (stringsOnly)  (useMissingValues)  Mode  Cnt      Score       Error  Units
[info] HashMapBenchmark.contains             10          false                true  avgt   20    260.206 ±    10.272  ns/op
[info] HashMapBenchmark.contains            100          false                true  avgt   20   2344.312 ±    89.308  ns/op
[info] HashMapBenchmark.contains           1000          false                true  avgt   20  38821.879 ±  2514.815  ns/op
[info] HashMapBenchmark.get                  10          false                true  avgt   20    281.659 ±     4.807  ns/op
[info] HashMapBenchmark.get                 100          false                true  avgt   20   2541.033 ±    28.172  ns/op
[info] HashMapBenchmark.get                1000          false                true  avgt   20  50316.082 ± 22211.830  ns/op
[info] HashMapBenchmark.getOrElse            10          false                true  avgt   20    346.650 ±     5.772  ns/op
[info] HashMapBenchmark.getOrElse           100          false                true  avgt   20   3307.532 ±   214.753  ns/op
[info] HashMapBenchmark.getOrElse          1000          false                true  avgt   20  47350.125 ±  1592.833  ns/op
[info] HashMapBenchmark.getOrElseUpdate      10          false                true  avgt   20    305.341 ±     6.406  ns/op
[info] HashMapBenchmark.getOrElseUpdate     100          false                true  avgt   20   3138.793 ±   160.772  ns/op
[info] HashMapBenchmark.getOrElseUpdate    1000          false                true  avgt   20  54917.471 ±  6087.586  ns/op

Scala 2.12.6

[info] Benchmark                         (size)  (stringsOnly)  (useMissingValues)  Mode  Cnt      Score      Error  Units
[info] HashMapBenchmark.contains             10          false                true  avgt   20    321.063 ±    9.857  ns/op
[info] HashMapBenchmark.contains            100          false                true  avgt   20   3505.131 ±  245.030  ns/op
[info] HashMapBenchmark.contains           1000          false                true  avgt   20  49901.758 ± 2116.674  ns/op
[info] HashMapBenchmark.get                  10          false                true  avgt   20    341.963 ±   10.851  ns/op
[info] HashMapBenchmark.get                 100          false                true  avgt   20   4004.163 ±  192.809  ns/op
[info] HashMapBenchmark.get                1000          false                true  avgt   20  56306.171 ± 2128.746  ns/op
[info] HashMapBenchmark.getOrElse            10          false                true  avgt   20    442.490 ±    9.025  ns/op
[info] HashMapBenchmark.getOrElse           100          false                true  avgt   20   5111.343 ±   28.240  ns/op
[info] HashMapBenchmark.getOrElse          1000          false                true  avgt   20  67521.642 ± 1711.990  ns/op
[info] HashMapBenchmark.getOrElseUpdate      10          false                true  avgt   20    381.666 ±   26.239  ns/op
[info] HashMapBenchmark.getOrElseUpdate     100          false                true  avgt   20   4395.782 ±   36.841  ns/op
[info] HashMapBenchmark.getOrElseUpdate    1000          false                true  avgt   20  60440.328 ± 1636.287  ns/op

To reproduce:

git clone git@github.com:fwbrasil/scala-graal.git
cd scala-graal
sbt "+jmh:run -w 1 -r 1 MutableHashMapBenchmark"

Note: the project is called scala-graal because I've been using it to test graal changes, but the jmh benchmark runs on C2 by default.

ShaneDelmore commented 6 years ago

@fwbrasil We should test with this patch: https://github.com/scala/scala/pull/7091

ShaneDelmore commented 6 years ago

@SethTisue Thinks these may be unrelated Mutable.HashMap performance regressions. But either way, worth a test with 2.12.7-bin-727964b

retronym commented 6 years ago

$ git diff 2.11.x..v2.12.6 src/library/scala/collection/mutable/HashTable.scala src/library/scala/runtime/BoxesRunTime.java | pbcopy

diff --git a/src/library/scala/collection/mutable/HashTable.scala b/src/library/scala/collection/mutable/HashTable.scala
index 4873aa3c3e..7ee1987e46 100644
--- a/src/library/scala/collection/mutable/HashTable.scala
+++ b/src/library/scala/collection/mutable/HashTable.scala
@@ -367,7 +367,11 @@ trait HashTable[A, Entry >: Null <: HashEntry[A, Entry]] extends HashTable.HashU
     * Note: we take the most significant bits of the hashcode, not the lower ones
     * this is of crucial importance when populating the table in parallel
     */
-  protected final def index(hcode: Int): Int = if (table.length == 1) 0 else improve(hcode, seedvalue) >>> numberOfLeadingZeros(table.length - 1)
+  protected final def index(hcode: Int): Int = {
+    val ones = table.length - 1
+    val exponent = Integer.numberOfLeadingZeros(ones)
+    (improve(hcode, seedvalue) >>> exponent) & ones
+  }

   protected def initWithContents(c: HashTable.Contents[A, Entry]) = {
     if (c != null) {
@@ -395,13 +399,13 @@ private[collection] object HashTable {
   /** The load factor for the hash table (in 0.001 step).
    */
   private[collection] final def defaultLoadFactor: Int = 750 // corresponds to 75%
-  private[collection] final def loadFactorDenum = 1000
+  private[collection] final def loadFactorDenum = 1000 // should be loadFactorDenom, but changing that isn't binary compatible

   private[collection] final def newThreshold(_loadFactor: Int, size: Int) = ((size.toLong * _loadFactor) / loadFactorDenum).toInt

   private[collection] final def sizeForThreshold(_loadFactor: Int, thr: Int) = ((thr.toLong * loadFactorDenum) / _loadFactor).toInt

-  private[collection] final def capacity(expectedSize: Int) = if (expectedSize == 0) 1 else powerOfTwo(expectedSize)
+  private[collection] final def capacity(expectedSize: Int) = nextPositivePowerOfTwo(expectedSize)

   trait HashUtils[KeyType] {
     protected final def sizeMapBucketBitSize = 5
@@ -421,7 +425,7 @@ private[collection] object HashTable {
       * h = h + (h << 4)
       * h ^ (h >>> 10)
       * }}}
-      * the rest of the computation is due to SI-5293
+      * the rest of the computation is due to scala/bug#5293
       */
     protected final def improve(hcode: Int, seed: Int): Int = rotateRight(byteswap32(hcode), seed)
   }
@@ -429,16 +433,7 @@ private[collection] object HashTable {
   /**
    * Returns a power of two >= `target`.
    */
-  private[collection] def powerOfTwo(target: Int): Int = {
-    /* See http://bits.stephan-brumme.com/roundUpToNextPowerOfTwo.html */
-    var c = target - 1
-    c |= c >>>  1
-    c |= c >>>  2
-    c |= c >>>  4
-    c |= c >>>  8
-    c |= c >>> 16
-    c + 1
-  }
+  private[collection] def nextPositivePowerOfTwo(target: Int): Int = 1 << -numberOfLeadingZeros(target - 1)

   class Contents[A, Entry >: Null <: HashEntry[A, Entry]](
     val loadFactor: Int,
diff --git a/src/library/scala/runtime/BoxesRunTime.java b/src/library/scala/runtime/BoxesRunTime.java
index 9cb1dee41c..6b3874fc1f 100644
--- a/src/library/scala/runtime/BoxesRunTime.java
+++ b/src/library/scala/runtime/BoxesRunTime.java
@@ -179,7 +179,7 @@ public final class BoxesRunTime
         return xc.equals(y);
     }

-    private static boolean equalsNumChar(java.lang.Number xn, java.lang.Character yc) {
+    public static boolean equalsNumChar(java.lang.Number xn, java.lang.Character yc) {
         if (yc == null)
             return xn == null;

@@ -198,70 +198,6 @@ public final class BoxesRunTime
         }
     }

-    /** Hashcode algorithm is driven by the requirements imposed
-     *  by primitive equality semantics, namely that equal objects
-     *  have equal hashCodes.  The first priority are the integral/char
-     *  types, which already have the same hashCodes for the same
-     *  values except for Long.  So Long's hashCode is altered to
-     *  conform to Int's for all values in Int's range.
-     *
-     *  Float is problematic because it's far too small to hold
-     *  all the Ints, so for instance Int.MaxValue.toFloat claims
-     *  to be == to each of the largest 64 Ints.  There is no way
-     *  to preserve equals/hashCode alignment without compromising
-     *  the hashCode distribution, so Floats are only guaranteed
-     *  to have the same hashCode for whole Floats in the range
-     *  Short.MinValue to Short.MaxValue (2^16 total.)
-     *
-     *  Double has its hashCode altered to match the entire Int range,
-     *  but is not guaranteed beyond that.  (But could/should it be?
-     *  The hashCode is only 32 bits so this is a more tractable
-     *  issue than Float's, but it might be better simply to exclude it.)
-     *
-     *  Note: BigInt and BigDecimal, being arbitrary precision, could
-     *  be made consistent with all other types for the Int range, but
-     *  as yet have not.
-     *
-     *  Note: Among primitives, Float.NaN != Float.NaN, but the boxed
-     *  versions are equal.  This still needs reconciliation.
-     */
-    public static int hashFromLong(java.lang.Long n) {
-        int iv = n.intValue();
-        if (iv == n.longValue()) return iv;
-        else return n.hashCode();
-    }
-    public static int hashFromDouble(java.lang.Double n) {
-        int iv = n.intValue();
-        double dv = n.doubleValue();
-        if (iv == dv) return iv;
-
-        long lv = n.longValue();
-        if (lv == dv) return java.lang.Long.valueOf(lv).hashCode();
-
-        float fv = n.floatValue();
-        if (fv == dv) return java.lang.Float.valueOf(fv).hashCode();
-        else return n.hashCode();
-    }
-    public static int hashFromFloat(java.lang.Float n) {
-        int iv = n.intValue();
-        float fv = n.floatValue();
-        if (iv == fv) return iv;
-
-        long lv = n.longValue();
-        if (lv == fv) return java.lang.Long.valueOf(lv).hashCode();
-        else return n.hashCode();
-    }
-    public static int hashFromNumber(java.lang.Number n) {
-      if (n instanceof java.lang.Long) return hashFromLong((java.lang.Long)n);
-      else if (n instanceof java.lang.Double) return hashFromDouble((java.lang.Double)n);
-      else if (n instanceof java.lang.Float) return hashFromFloat((java.lang.Float)n);
-      else return n.hashCode();
-    }
-    public static int hashFromObject(Object a) {
-      if (a instanceof Number) return hashFromNumber((Number)a);
-      else return a.hashCode();
-    }
-
     private static int unboxCharOrInt(Object arg1, int code) {
       if (code == CHAR)
         return ((java.lang.Character) arg1).charValue();

Comparative profiles: https://www.dropbox.com/sh/adpdurb9t3qm1mu/AAD0V9Wyscj05eMWbNvSGyfSa?dl=0

Gathered with:

$ sbt "+jmh:run -w 1 -r 1 MutableHashMapBenchmark.contains -psize=1000 -jvmArgs -XX:+UnlockDiagnosticVMOptions -jvmArgs -XX:+DebugNonSafepoints -jvmArgs  -prof jmh.extras.Async -f1 -jvmArgs -Dsun.reflect.inflationThreshold=0"

retronym commented 6 years ago

With the default level of -XX:InlineLevel=9.

2.12.6 doesn't inline all of BoxesRuntime into contains:

Whereas 2.11.11 does:

Simply setting -XX:+MaxInlineLevel=15 or higher sounds it would help, but doesn't seem to. We'll need to analyze the JIT logs some more to see why.

BTW, I'm getting the JIT logs with:

$  (export V=2.11.11; sbt "++$V" "jmh:run -w 1 -r 1 MutableHashMapBenchmark.contains -psize=1000 -jvmArgs -XX:+UnlockDiagnosticVMOptions -jvmArgs -XX:+DebugNonSafepoints -f1 -jvmArgs -Dsun.reflect.inflationThreshold=0 -jvmArgs -jvmArgs -XX:+TraceClassLoading -jvmArgs -XX:+LogCompilation -jvmArgs -XX:LogFile=/Users/jz/Dropbox/Public/hashmap-benchmark/$V/hotspot.log")

retronym commented 6 years ago

Another thing to try is a more surgical CompilerControl to effectively add @ForceInline to the right places (e.g. BoxesRuntime.equals*, maybe even all trait forwarder methods.

retronym commented 6 years ago

I should also advertise mutable.AnyRefMap, which avoids all the BoxesRuntime indirection by design, and is suitable in cases when you concretely know that your key type is <: AnyRef. Similarly, a java.util.HashMap, perhaps wrapped in .toScala, is worth considering for performance sensitive code that doesn't require cooperative equality semantics.

retronym commented 6 years ago

Adding -XX:MaxInlineLevel=18` to 2.12.6 shows up the same JIT inlining logs as 2.11.

In addition to the extra bridging in 2.12.6 due to the new trait encoding, I've noticed another difference.

In 2.11.11, the static trait implementation method of HashTable.elemEquals was type casing the element and dispatching to specific comparison methods in BoxesRuntime.

  // access flags 0x9
  public static elemEquals(Lscala/collection/mutable/HashTable;Ljava/lang/Object;Ljava/lang/Object;)Z
    ALOAD 1
    ALOAD 2
    IF_ACMPNE L0
    ICONST_1
    GOTO L1
   L0
    ALOAD 1
    IFNONNULL L2
    ICONST_0
    GOTO L1
   L2
    ALOAD 1
    INSTANCEOF java/lang/Number
    IFEQ L3
    ALOAD 1
    CHECKCAST java/lang/Number
    ALOAD 2
    INVOKESTATIC scala/runtime/BoxesRunTime.equalsNumObject (Ljava/lang/Number;Ljava/lang/Object;)Z
    GOTO L1
   L3
    ALOAD 1
    INSTANCEOF java/lang/Character
    IFEQ L4
    ALOAD 1
    CHECKCAST java/lang/Character
    ALOAD 2
    INVOKESTATIC scala/runtime/BoxesRunTime.equalsCharObject (Ljava/lang/Character;Ljava/lang/Object;)Z
    GOTO L1
   L4
    ALOAD 1
    ALOAD 2
    INVOKEVIRTUAL java/lang/Object.equals (Ljava/lang/Object;)Z
   L1
    IFEQ L5
    ICONST_1
    GOTO L6
   L5
    ICONST_0
   L6
    IRETURN
    MAXSTACK = 2
    MAXLOCALS = 3

Looks like the old optimizer inlined AOT

$ scalac-launch 2.11.11 -optimise $(f "trait C { def elemEquals(a: Any, b: Any) = a == b }") && jardiff 'C$class.class'

+++ C$class.class.asm
// class version 50.0 (50)
// access flags 0x421
public abstract class C$class {

  // access flags 0x9
  public static $init$(LC;)V
    RETURN
    MAXSTACK = 0
    MAXLOCALS = 1

  // access flags 0x9
  public static elemEquals(LC;Ljava/lang/Object;Ljava/lang/Object;)Z
    ALOAD 1
    ALOAD 2
    IF_ACMPEQ L0
    ALOAD 1
    IFNULL L1
    ALOAD 1
    INSTANCEOF java/lang/Number
    IFNE L2
    ALOAD 1
    INSTANCEOF java/lang/Character
    IFNE L3
    ALOAD 1
    ALOAD 2
    INVOKEVIRTUAL java/lang/Object.equals (Ljava/lang/Object;)Z
    GOTO L4
   L3
    ALOAD 1
    CHECKCAST java/lang/Character
    ALOAD 2
    INVOKESTATIC scala/runtime/BoxesRunTime.equalsCharObject (Ljava/lang/Character;Ljava/lang/Object;)Z
    GOTO L4
   L2
    ALOAD 1
    CHECKCAST java/lang/Number
    ALOAD 2
    INVOKESTATIC scala/runtime/BoxesRunTime.equalsNumObject (Ljava/lang/Number;Ljava/lang/Object;)Z
    GOTO L4
   L1
    ICONST_0
    GOTO L4
   L0
    ICONST_1
   L4
    IFEQ L5
    ICONST_1
    GOTO L6
   L5
    ICONST_0
   L6
    IRETURN
    MAXSTACK = 2
    MAXLOCALS = 3
}
~/code/jdk-skara on master*
$ scalac-launch 2.11.11 $(f "trait C { def elemEquals(a: Any, b: Any) = a == b }") && jardiff 'C$class.class'

+++ C$class.class.asm
// class version 50.0 (50)
// access flags 0x421
public abstract class C$class {

  // access flags 0x9
  public static $init$(LC;)V
    RETURN
    MAXSTACK = 0
    MAXLOCALS = 1

  // access flags 0x9
  public static elemEquals(LC;Ljava/lang/Object;Ljava/lang/Object;)Z
    ALOAD 1
    ALOAD 2
    INVOKESTATIC scala/runtime/BoxesRunTime.equals (Ljava/lang/Object;Ljava/lang/Object;)Z
    IFEQ L0
    ICONST_1
    GOTO L1
   L0
    ICONST_0
   L1
    IRETURN
    MAXSTACK = 2
    MAXLOCALS = 3
}

retronym commented 6 years ago

Okay, I think I've spotted the primary problem.

In https://github.com/scala/scala/pull/5098, the code gen for Any.## was changed to call Statics.anyHash rather than ScalaRuntime$.hash. But anyHash does three instanceof-s ( Float / Int / Short in a row, rather first checking for Number.

Adding that type test gets performance much closer to 2.11. I can squeeze a little more by hand-crafting elemEquals to optimize for the common case of keys of identical classes.

https://github.com/scala/scala/compare/2.12.x...retronym:topic/hashityhash?expand=1

$ (for V in 2.11.11 2.12.6 2.12.7-bin-2d7eaf7-SNAPSHOT 2.12.7-bin-642e034-SNAPSHOT; do echo "SCALA=$V"; sbt "++$V" "jmh:run -w 1 -r 1 MutableHashMapBenchmark.contains -psize=1000  -f2 "; done) | egrep 'SCALA=|MutableHashMapBenchmark.*ns/op'
SCALA=2.11.11
[info] MutableHashMapBenchmark.contains    1000          false                true  avgt   20  29863.646 ± 199.904  ns/op
SCALA=2.12.6
[info] MutableHashMapBenchmark.contains    1000          false                true  avgt   20  40008.217 ± 849.133  ns/op
SCALA=2.12.7-bin-2d7eaf7-SNAPSHOT
[info] MutableHashMapBenchmark.contains    1000          false                true  avgt   20  33488.598 ± 174.012  ns/op
SCALA=2.12.7-bin-642e034-SNAPSHOT
[info] MutableHashMapBenchmark.contains    1000          false                true  avgt   20  31533.114 ± 320.620  ns/op
/code/scala-graal on master*
$ (for V in 2.12.7-bin-642e034-SNAPSHOT; do echo "SCALA=$V"; sbt "++$V" "jmh:run -w 1 -r 1 MutableHashMapBenchmark.contains -psize=1000  -f2 -jvmArgs -XX:MaxInlineLevel=15"; done) | egrep 'SCALA=|MutableHashMapBenchmark.*ns/op'
SCALA=2.12.7-bin-642e034-SNAPSHOT
[info] MutableHashMapBenchmark.contains    1000          false                true  avgt   20  31064.080 ± 270.190  ns/op

retronym commented 6 years ago

BTW, @fwbrasil There are probably a few things in scala.runtime that you could improve with intrinsics in Graal.

e.g. in

public static boolean equals2(Object x, Object y) {
        if (x instanceof java.lang.Number)
            return equalsNumObject((java.lang.Number)x, y);
        if (x instanceof java.lang.Character)
            return equalsCharObject((java.lang.Character)x, y);
        if (x == null)
            return y == null;

        return x.equals(y);
    }

Both type tests are comparing against the same slot in the x-s superclass array, so an intrinsic could just grab that once and compare with the two different classes.

retronym commented 6 years ago

Thanks, @fwbrasil @ShaneDelmore for the report.

fwbrasil commented 6 years ago

@retronym thank you for the quick fix!

scala / bug

mutable.HashMap performance regression in Scala 2.12 #11171