Closed fwbrasil closed 6 years ago
@fwbrasil We should test with this patch: https://github.com/scala/scala/pull/7091
@SethTisue Thinks these may be unrelated Mutable.HashMap performance regressions. But either way, worth a test with 2.12.7-bin-727964b
$ git diff 2.11.x..v2.12.6 src/library/scala/collection/mutable/HashTable.scala src/library/scala/runtime/BoxesRunTime.java | pbcopy
diff --git a/src/library/scala/collection/mutable/HashTable.scala b/src/library/scala/collection/mutable/HashTable.scala
index 4873aa3c3e..7ee1987e46 100644
--- a/src/library/scala/collection/mutable/HashTable.scala
+++ b/src/library/scala/collection/mutable/HashTable.scala
@@ -367,7 +367,11 @@ trait HashTable[A, Entry >: Null <: HashEntry[A, Entry]] extends HashTable.HashU
* Note: we take the most significant bits of the hashcode, not the lower ones
* this is of crucial importance when populating the table in parallel
*/
- protected final def index(hcode: Int): Int = if (table.length == 1) 0 else improve(hcode, seedvalue) >>> numberOfLeadingZeros(table.length - 1)
+ protected final def index(hcode: Int): Int = {
+ val ones = table.length - 1
+ val exponent = Integer.numberOfLeadingZeros(ones)
+ (improve(hcode, seedvalue) >>> exponent) & ones
+ }
protected def initWithContents(c: HashTable.Contents[A, Entry]) = {
if (c != null) {
@@ -395,13 +399,13 @@ private[collection] object HashTable {
/** The load factor for the hash table (in 0.001 step).
*/
private[collection] final def defaultLoadFactor: Int = 750 // corresponds to 75%
- private[collection] final def loadFactorDenum = 1000
+ private[collection] final def loadFactorDenum = 1000 // should be loadFactorDenom, but changing that isn't binary compatible
private[collection] final def newThreshold(_loadFactor: Int, size: Int) = ((size.toLong * _loadFactor) / loadFactorDenum).toInt
private[collection] final def sizeForThreshold(_loadFactor: Int, thr: Int) = ((thr.toLong * loadFactorDenum) / _loadFactor).toInt
- private[collection] final def capacity(expectedSize: Int) = if (expectedSize == 0) 1 else powerOfTwo(expectedSize)
+ private[collection] final def capacity(expectedSize: Int) = nextPositivePowerOfTwo(expectedSize)
trait HashUtils[KeyType] {
protected final def sizeMapBucketBitSize = 5
@@ -421,7 +425,7 @@ private[collection] object HashTable {
* h = h + (h << 4)
* h ^ (h >>> 10)
* }}}
- * the rest of the computation is due to SI-5293
+ * the rest of the computation is due to scala/bug#5293
*/
protected final def improve(hcode: Int, seed: Int): Int = rotateRight(byteswap32(hcode), seed)
}
@@ -429,16 +433,7 @@ private[collection] object HashTable {
/**
* Returns a power of two >= `target`.
*/
- private[collection] def powerOfTwo(target: Int): Int = {
- /* See http://bits.stephan-brumme.com/roundUpToNextPowerOfTwo.html */
- var c = target - 1
- c |= c >>> 1
- c |= c >>> 2
- c |= c >>> 4
- c |= c >>> 8
- c |= c >>> 16
- c + 1
- }
+ private[collection] def nextPositivePowerOfTwo(target: Int): Int = 1 << -numberOfLeadingZeros(target - 1)
class Contents[A, Entry >: Null <: HashEntry[A, Entry]](
val loadFactor: Int,
diff --git a/src/library/scala/runtime/BoxesRunTime.java b/src/library/scala/runtime/BoxesRunTime.java
index 9cb1dee41c..6b3874fc1f 100644
--- a/src/library/scala/runtime/BoxesRunTime.java
+++ b/src/library/scala/runtime/BoxesRunTime.java
@@ -179,7 +179,7 @@ public final class BoxesRunTime
return xc.equals(y);
}
- private static boolean equalsNumChar(java.lang.Number xn, java.lang.Character yc) {
+ public static boolean equalsNumChar(java.lang.Number xn, java.lang.Character yc) {
if (yc == null)
return xn == null;
@@ -198,70 +198,6 @@ public final class BoxesRunTime
}
}
- /** Hashcode algorithm is driven by the requirements imposed
- * by primitive equality semantics, namely that equal objects
- * have equal hashCodes. The first priority are the integral/char
- * types, which already have the same hashCodes for the same
- * values except for Long. So Long's hashCode is altered to
- * conform to Int's for all values in Int's range.
- *
- * Float is problematic because it's far too small to hold
- * all the Ints, so for instance Int.MaxValue.toFloat claims
- * to be == to each of the largest 64 Ints. There is no way
- * to preserve equals/hashCode alignment without compromising
- * the hashCode distribution, so Floats are only guaranteed
- * to have the same hashCode for whole Floats in the range
- * Short.MinValue to Short.MaxValue (2^16 total.)
- *
- * Double has its hashCode altered to match the entire Int range,
- * but is not guaranteed beyond that. (But could/should it be?
- * The hashCode is only 32 bits so this is a more tractable
- * issue than Float's, but it might be better simply to exclude it.)
- *
- * Note: BigInt and BigDecimal, being arbitrary precision, could
- * be made consistent with all other types for the Int range, but
- * as yet have not.
- *
- * Note: Among primitives, Float.NaN != Float.NaN, but the boxed
- * versions are equal. This still needs reconciliation.
- */
- public static int hashFromLong(java.lang.Long n) {
- int iv = n.intValue();
- if (iv == n.longValue()) return iv;
- else return n.hashCode();
- }
- public static int hashFromDouble(java.lang.Double n) {
- int iv = n.intValue();
- double dv = n.doubleValue();
- if (iv == dv) return iv;
-
- long lv = n.longValue();
- if (lv == dv) return java.lang.Long.valueOf(lv).hashCode();
-
- float fv = n.floatValue();
- if (fv == dv) return java.lang.Float.valueOf(fv).hashCode();
- else return n.hashCode();
- }
- public static int hashFromFloat(java.lang.Float n) {
- int iv = n.intValue();
- float fv = n.floatValue();
- if (iv == fv) return iv;
-
- long lv = n.longValue();
- if (lv == fv) return java.lang.Long.valueOf(lv).hashCode();
- else return n.hashCode();
- }
- public static int hashFromNumber(java.lang.Number n) {
- if (n instanceof java.lang.Long) return hashFromLong((java.lang.Long)n);
- else if (n instanceof java.lang.Double) return hashFromDouble((java.lang.Double)n);
- else if (n instanceof java.lang.Float) return hashFromFloat((java.lang.Float)n);
- else return n.hashCode();
- }
- public static int hashFromObject(Object a) {
- if (a instanceof Number) return hashFromNumber((Number)a);
- else return a.hashCode();
- }
-
private static int unboxCharOrInt(Object arg1, int code) {
if (code == CHAR)
return ((java.lang.Character) arg1).charValue();
Comparative profiles: https://www.dropbox.com/sh/adpdurb9t3qm1mu/AAD0V9Wyscj05eMWbNvSGyfSa?dl=0
Gathered with:
$ sbt "+jmh:run -w 1 -r 1 MutableHashMapBenchmark.contains -psize=1000 -jvmArgs -XX:+UnlockDiagnosticVMOptions -jvmArgs -XX:+DebugNonSafepoints -jvmArgs -prof jmh.extras.Async -f1 -jvmArgs -Dsun.reflect.inflationThreshold=0"
With the default level of -XX:InlineLevel=9
.
2.12.6 doesn't inline all of BoxesRuntime
into contains
:
Whereas 2.11.11 does:
Simply setting -XX:+MaxInlineLevel=15
or higher sounds it would help, but doesn't seem to. We'll need to analyze the JIT logs some more to see why.
BTW, I'm getting the JIT logs with:
$ (export V=2.11.11; sbt "++$V" "jmh:run -w 1 -r 1 MutableHashMapBenchmark.contains -psize=1000 -jvmArgs -XX:+UnlockDiagnosticVMOptions -jvmArgs -XX:+DebugNonSafepoints -f1 -jvmArgs -Dsun.reflect.inflationThreshold=0 -jvmArgs -jvmArgs -XX:+TraceClassLoading -jvmArgs -XX:+LogCompilation -jvmArgs -XX:LogFile=/Users/jz/Dropbox/Public/hashmap-benchmark/$V/hotspot.log")
Another thing to try is a more surgical CompilerControl
to effectively add @ForceInline
to the right places (e.g. BoxesRuntime.equals*
, maybe even all trait forwarder methods.
I should also advertise mutable.AnyRefMap
, which avoids all the BoxesRuntime
indirection by design, and is suitable in cases when you concretely know that your key type is <: AnyRef
. Similarly, a java.util.HashMap
, perhaps wrapped in .toScala
, is worth considering for performance sensitive code that doesn't require cooperative equality semantics.
Adding -XX:MaxInlineLevel=18` to 2.12.6 shows up the same JIT inlining logs as 2.11.
In addition to the extra bridging in 2.12.6 due to the new trait encoding, I've noticed another difference.
In 2.11.11, the static trait implementation method of HashTable.elemEquals
was type casing the element and dispatching to specific comparison methods in BoxesRuntime
.
// access flags 0x9
public static elemEquals(Lscala/collection/mutable/HashTable;Ljava/lang/Object;Ljava/lang/Object;)Z
ALOAD 1
ALOAD 2
IF_ACMPNE L0
ICONST_1
GOTO L1
L0
ALOAD 1
IFNONNULL L2
ICONST_0
GOTO L1
L2
ALOAD 1
INSTANCEOF java/lang/Number
IFEQ L3
ALOAD 1
CHECKCAST java/lang/Number
ALOAD 2
INVOKESTATIC scala/runtime/BoxesRunTime.equalsNumObject (Ljava/lang/Number;Ljava/lang/Object;)Z
GOTO L1
L3
ALOAD 1
INSTANCEOF java/lang/Character
IFEQ L4
ALOAD 1
CHECKCAST java/lang/Character
ALOAD 2
INVOKESTATIC scala/runtime/BoxesRunTime.equalsCharObject (Ljava/lang/Character;Ljava/lang/Object;)Z
GOTO L1
L4
ALOAD 1
ALOAD 2
INVOKEVIRTUAL java/lang/Object.equals (Ljava/lang/Object;)Z
L1
IFEQ L5
ICONST_1
GOTO L6
L5
ICONST_0
L6
IRETURN
MAXSTACK = 2
MAXLOCALS = 3
Looks like the old optimizer inlined AOT
$ scalac-launch 2.11.11 -optimise $(f "trait C { def elemEquals(a: Any, b: Any) = a == b }") && jardiff 'C$class.class'
+++ C$class.class.asm
// class version 50.0 (50)
// access flags 0x421
public abstract class C$class {
// access flags 0x9
public static $init$(LC;)V
RETURN
MAXSTACK = 0
MAXLOCALS = 1
// access flags 0x9
public static elemEquals(LC;Ljava/lang/Object;Ljava/lang/Object;)Z
ALOAD 1
ALOAD 2
IF_ACMPEQ L0
ALOAD 1
IFNULL L1
ALOAD 1
INSTANCEOF java/lang/Number
IFNE L2
ALOAD 1
INSTANCEOF java/lang/Character
IFNE L3
ALOAD 1
ALOAD 2
INVOKEVIRTUAL java/lang/Object.equals (Ljava/lang/Object;)Z
GOTO L4
L3
ALOAD 1
CHECKCAST java/lang/Character
ALOAD 2
INVOKESTATIC scala/runtime/BoxesRunTime.equalsCharObject (Ljava/lang/Character;Ljava/lang/Object;)Z
GOTO L4
L2
ALOAD 1
CHECKCAST java/lang/Number
ALOAD 2
INVOKESTATIC scala/runtime/BoxesRunTime.equalsNumObject (Ljava/lang/Number;Ljava/lang/Object;)Z
GOTO L4
L1
ICONST_0
GOTO L4
L0
ICONST_1
L4
IFEQ L5
ICONST_1
GOTO L6
L5
ICONST_0
L6
IRETURN
MAXSTACK = 2
MAXLOCALS = 3
}
~/code/jdk-skara on master*
$ scalac-launch 2.11.11 $(f "trait C { def elemEquals(a: Any, b: Any) = a == b }") && jardiff 'C$class.class'
+++ C$class.class.asm
// class version 50.0 (50)
// access flags 0x421
public abstract class C$class {
// access flags 0x9
public static $init$(LC;)V
RETURN
MAXSTACK = 0
MAXLOCALS = 1
// access flags 0x9
public static elemEquals(LC;Ljava/lang/Object;Ljava/lang/Object;)Z
ALOAD 1
ALOAD 2
INVOKESTATIC scala/runtime/BoxesRunTime.equals (Ljava/lang/Object;Ljava/lang/Object;)Z
IFEQ L0
ICONST_1
GOTO L1
L0
ICONST_0
L1
IRETURN
MAXSTACK = 2
MAXLOCALS = 3
}
Okay, I think I've spotted the primary problem.
In https://github.com/scala/scala/pull/5098, the code gen for Any.##
was changed to call Statics.anyHash
rather than ScalaRuntime$.hash
. But anyHash
does three instanceof
-s ( Float
/ Int
/ Short
in a row, rather first checking for Number
.
Adding that type test gets performance much closer to 2.11. I can squeeze a little more by hand-crafting elemEquals
to optimize for the common case of keys of identical classes.
https://github.com/scala/scala/compare/2.12.x...retronym:topic/hashityhash?expand=1
$ (for V in 2.11.11 2.12.6 2.12.7-bin-2d7eaf7-SNAPSHOT 2.12.7-bin-642e034-SNAPSHOT; do echo "SCALA=$V"; sbt "++$V" "jmh:run -w 1 -r 1 MutableHashMapBenchmark.contains -psize=1000 -f2 "; done) | egrep 'SCALA=|MutableHashMapBenchmark.*ns/op'
SCALA=2.11.11
[info] MutableHashMapBenchmark.contains 1000 false true avgt 20 29863.646 ± 199.904 ns/op
SCALA=2.12.6
[info] MutableHashMapBenchmark.contains 1000 false true avgt 20 40008.217 ± 849.133 ns/op
SCALA=2.12.7-bin-2d7eaf7-SNAPSHOT
[info] MutableHashMapBenchmark.contains 1000 false true avgt 20 33488.598 ± 174.012 ns/op
SCALA=2.12.7-bin-642e034-SNAPSHOT
[info] MutableHashMapBenchmark.contains 1000 false true avgt 20 31533.114 ± 320.620 ns/op
/code/scala-graal on master*
$ (for V in 2.12.7-bin-642e034-SNAPSHOT; do echo "SCALA=$V"; sbt "++$V" "jmh:run -w 1 -r 1 MutableHashMapBenchmark.contains -psize=1000 -f2 -jvmArgs -XX:MaxInlineLevel=15"; done) | egrep 'SCALA=|MutableHashMapBenchmark.*ns/op'
SCALA=2.12.7-bin-642e034-SNAPSHOT
[info] MutableHashMapBenchmark.contains 1000 false true avgt 20 31064.080 ± 270.190 ns/op
BTW, @fwbrasil There are probably a few things in scala.runtime
that you could improve with intrinsics in Graal.
e.g. in
public static boolean equals2(Object x, Object y) {
if (x instanceof java.lang.Number)
return equalsNumObject((java.lang.Number)x, y);
if (x instanceof java.lang.Character)
return equalsCharObject((java.lang.Character)x, y);
if (x == null)
return y == null;
return x.equals(y);
}
Both type tests are comparing against the same slot in the x
-s superclass array, so an intrinsic could just grab that once and compare with the two different classes.
Thanks, @fwbrasil @ShaneDelmore for the report.
@retronym thank you for the quick fix!
We're currently migrating from Scala 2.11.11 to 2.12.6 at Twitter and observed a considerable performance regression. One of the factors seems to be more costly mutable hash map lookups. I've copied
HashMapBenchmark
and compared the results:Scala 2.11.11
Scala 2.12.6
To reproduce:
Note: the project is called
scala-graal
because I've been using it to test graal changes, but the jmh benchmark runs onC2
by default.