Open tnielens opened 2 years ago
See https://github.com/pityka/saddle/commit/bc8f2dedcf82424e6e00960837f1032fb0ad83b5 on how to inspect the machine instructions.
Initial impressions:
Pushed a commit with updated jmh runs.
@tnielens You asked a few weeks ago about those macro based inlined ops. It turns out there are some jmh benchmarks in the tree for them. At least in some of these benchmarks the macros are on par with the raw arrays, while the regular ops are much slower. In some other of these runs all 3 implementations were similar.
Also I am not sure how to interpet the dot product results. I expected that the manual unroll will be the slowest as that is supposed not to trigger the auto vectorization (following the linked presentation).
[info] Benchmark (size) Mode Cnt Score Error Units
[info] DiagOuterMBench.blas 10000 thrpt 6 344.898 ± 7.183 ops/s
[info] DiagOuterMBench.java 10000 thrpt 6 199.842 ± 2.361 ops/s
[info] DiagOuterMBench.java_unroll 10000 thrpt 6 342.964 ± 6.161 ops/s
[info] DotBench.blas 500 thrpt 6 4513031.226 ± 717923.742 ops/s
[info] DotBench.blas 1000 thrpt 6 3131796.506 ± 176731.660 ops/s
[info] DotBench.blas 10000 thrpt 6 272513.086 ± 42022.945 ops/s
[info] DotBench.raw 500 thrpt 6 2062884.020 ± 48700.200 ops/s
[info] DotBench.raw 1000 thrpt 6 1028810.536 ± 33906.870 ops/s
[info] DotBench.raw 10000 thrpt 6 101922.090 ± 3294.607 ops/s
[info] DotBench.raw_unrolled 500 thrpt 6 4120157.830 ± 67755.724 ops/s
[info] DotBench.raw_unrolled 1000 thrpt 6 2122200.069 ± 49000.282 ops/s
[info] DotBench.raw_unrolled 10000 thrpt 6 205568.475 ± 4060.935 ops/s
[info] FourInPlaceScalarLongOpBench.array 10 thrpt 6 828314.792 ± 10742.271 ops/s
[info] FourInPlaceScalarLongOpBench.array 100 thrpt 6 8377.487 ± 495.774 ops/s
[info] FourInPlaceScalarLongOpBench.array 10000 thrpt 6 0.772 ± 0.014 ops/s
[info] FourInPlaceScalarLongOpBench.saddleInlined 10 thrpt 6 839221.606 ± 13943.735 ops/s
[info] FourInPlaceScalarLongOpBench.saddleInlined 100 thrpt 6 8104.838 ± 249.580 ops/s
[info] FourInPlaceScalarLongOpBench.saddleInlined 10000 thrpt 6 0.807 ± 0.025 ops/s
[info] FourInPlaceScalarLongOpBench.saddleVirtual 10 thrpt 6 407474.259 ± 8826.223 ops/s
[info] FourInPlaceScalarLongOpBench.saddleVirtual 100 thrpt 6 3569.076 ± 1871.019 ops/s
[info] FourInPlaceScalarLongOpBench.saddleVirtual 10000 thrpt 6 0.288 ± 0.009 ops/s
[info] FourInPlaceScalarOpBench.array 10 thrpt 6 3365232.709 ± 148713.142 ops/s
[info] FourInPlaceScalarOpBench.array 100 thrpt 6 31907.610 ± 759.861 ops/s
[info] FourInPlaceScalarOpBench.array 10000 thrpt 6 2.041 ± 0.045 ops/s
[info] FourInPlaceScalarOpBench.saddleInlined 10 thrpt 6 3222275.123 ± 87499.198 ops/s
[info] FourInPlaceScalarOpBench.saddleInlined 100 thrpt 6 31790.690 ± 471.759 ops/s
[info] FourInPlaceScalarOpBench.saddleInlined 10000 thrpt 6 2.044 ± 0.043 ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual 10 thrpt 6 735414.566 ± 13669.387 ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual 100 thrpt 6 6774.385 ± 3574.689 ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual 10000 thrpt 6 0.437 ± 0.015 ops/s
[info] SingleInPlaceScalarOpBench.array 10 thrpt 6 35472703.133 ± 1859036.673 ops/s
[info] SingleInPlaceScalarOpBench.array 100 thrpt 6 344887.796 ± 6034.971 ops/s
[info] SingleInPlaceScalarOpBench.array 10000 thrpt 6 11.812 ± 0.998 ops/s
[info] SingleInPlaceScalarOpBench.saddleInlined 10 thrpt 6 34920249.527 ± 1743154.820 ops/s
[info] SingleInPlaceScalarOpBench.saddleInlined 100 thrpt 6 350826.518 ± 4471.170 ops/s
[info] SingleInPlaceScalarOpBench.saddleInlined 10000 thrpt 6 11.523 ± 0.163 ops/s
[info] SingleInPlaceScalarOpBench.saddleVirtual 10 thrpt 6 34390962.439 ± 2473616.086 ops/s
[info] SingleInPlaceScalarOpBench.saddleVirtual 100 thrpt 6 351251.137 ± 2994.687 ops/s
[info] SingleInPlaceScalarOpBench.saddleVirtual 10000 thrpt 6 11.508 ± 0.383 ops/s
[info] TwoInPlaceScalarOpBench.array 10 thrpt 10 21418287.695 ± 710209.148 ops/s
[info] TwoInPlaceScalarOpBench.array 100 thrpt 10 173638.571 ± 3125.913 ops/s
[info] TwoInPlaceScalarOpBench.array 10000 thrpt 10 5.823 ± 0.115 ops/s
[info] TwoInPlaceScalarOpBench.saddleInlined 10 thrpt 10 19015864.107 ± 296979.926 ops/s
[info] TwoInPlaceScalarOpBench.saddleInlined 100 thrpt 10 175873.725 ± 5163.732 ops/s
[info] TwoInPlaceScalarOpBench.saddleInlined 10000 thrpt 10 6.166 ± 0.129 ops/s
[info] TwoInPlaceScalarOpBench.saddleVirtual 10 thrpt 10 20191119.494 ± 364440.585 ops/s
[info] TwoInPlaceScalarOpBench.saddleVirtual 100 thrpt 10 179552.482 ± 3896.277 ops/s
[info] TwoInPlaceScalarOpBench.saddleVirtual 10000 thrpt 10 6.208 ± 0.042 ops/s
jdk17.02, older intel cpu with avx1
Awesome!
Great post you linked: https://blogs.oracle.com/javamagazine/post/java-hotspot-hsdis-disassembler
Good to know about the existing benchmarks. I wanna have a look at them.
Some remarks about FourInPlaceScalarOpBench
and why saddleVirtual
performs worse. Note that TwoInPlaceScalarOpBench
performs equally well, and it starts to deteriorate from 3 subsequent different operators. This is probably due to type optimistic optimizations that are deoptimized once the third different operator is observed by the vm.
In MatMatElemOp.apply
, the wrapped scala bin operator op
, is a virtual call. If a single (or two) subtype are encountered by the vm, it will generate code optimized for it, and enable inlining and vectorization. This is described in the openjdk wiki: Virtual (and interface) invocations with a lopsided type profile are compiled with an optimistic check in favor of the historically common type (or two types).
.
This can observed with this benchmark:
@State(Scope.Benchmark)
// important
@Fork(0)
@Threads(1)
class FourInPlaceScalarOpBench {
@Param(Array("100"))
var size: Int = _
var m1: Mat[Double] = _
var b: Double = _
@Setup(Level.Iteration)
def setup() = {
m1 = mat.randn(size, size)
b = scala.util.Random.nextDouble()
}
@Benchmark
def saddleVirtual1Add(): Mat[Double] = {
import org.saddle.ops.BinOps._
m1 += b
m1
}
@Benchmark
def saddleVirtual1Mul(): Mat[Double] = {
import org.saddle.ops.BinOps._
m1 *= b
m1
}
@Benchmark
def saddleVirtual1Sub(): Mat[Double] = {
import org.saddle.ops.BinOps._
m1 -= b
m1
}
@Benchmark
def saddleVirtual2Add(): Mat[Double] = {
import org.saddle.ops.BinOps._
m1 += b
m1
}
@Benchmark
def saddleVirtual2Mul(): Mat[Double] = {
import org.saddle.ops.BinOps._
m1 *= b
m1
}
@Benchmark
def saddleVirtual2Sub(): Mat[Double] = {
import org.saddle.ops.BinOps._
m1 -= b
m1
}
}
Results:
[info] Benchmark (size) Mode Cnt Score Error Units
[info] FourInPlaceScalarOpBench.saddleVirtual1Add 100 thrpt 2 913836.177 ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual1Mul 100 thrpt 2 949849.358 ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual1Sub 100 thrpt 2 27578.577 ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual2Add 100 thrpt 2 27544.876 ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual2Mul 100 thrpt 2 17066.358 ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual2Sub 100 thrpt 2 27697.968 ops/s
This behavior is also described in one of the jmh samples.
Very interesting. Thank you! Unfortunately the Op is very likely to be strongly polymorphic in real world client code ie that virtual call ordeoptimization will occur unless inlined by the macro.
I will open a pr to document this.
Another possibility is to rework the saddle types to eliminate the Op interface. However this would likely lead to the same inlined code as in the macro module.
I would not 100% oppose simplifying this part of saddle, that is removing the whole BinOp abstraction machinery.
Adding extra @inline
annotations seems to cancel the megamorphic deoptimizations. The loop should be fully inlined at callsite and optimized separately.
Extra annotations on the following methods brings regular BinOps on par with hand-written while loops and macros.
// BinOpMat
@inline
implicit def MatSclrElmOpIpDDD[Op <: ScalarOp](implicit
op: BinOp[Op, Double, Double, Double]
)
final class MatSclrElemOp[
OP <: ScalarOp,
@spec(Boolean, Int, Long, Double) A,
@spec(Boolean, Int, Long, Double) B,
@spec(Boolean, Int, Long, Double) C: ST
](val op: BinOp[OP, A, B, C])
extends BinOp[OP, Mat[A], B, Mat[C]] {
@inline def apply(v1: Mat[A], v2: B) = {
...
// NumericOps
@inline
final def +=[B](
other: B
)(implicit op: BinOpInPlace[Add, This, B]): Unit =
op(this, other)
@inline
final def -=[B](
other: B
)(implicit op: BinOpInPlace[Subtract, This, B]): Unit =
op(this, other)
//
Results
[info] Benchmark (size) Mode Cnt Score Error Units
[info] FourInPlaceScalarOpBench.saddleVirtual1Add 100 thrpt 1004988.957 ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual1Div 100 thrpt 212128.273 ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual1Mul 100 thrpt 998022.449 ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual1Sub 100 thrpt 991270.099 ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual2Add 100 thrpt 994354.111 ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual2Div 100 thrpt 211754.855 ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual2Mul 100 thrpt 998168.775 ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual2Sub 100 thrpt 973682.103 ops/s
That said, the scala optimizer inlining must be enabled explicitely in user code for these annotations to take effect. Details are here and there are some caveats with binary compatibility. Libraries based on saddle should not enable inlining.
Thank you for the investigation. If I understand that means that if we sprinkle enough annotations and the very end user of the final application adds -opt:inline:** (and potentially -opt:local ) then the megamorphic call sites will be inlined. I would vote for this solution in favor the macros. Only thing is we need a suite to ensure this happens.
What is your suggestion how should we proceed on these? Should we first make a benchmarking suite where all operations and methods are checked against a hand written array based implementation and use that as a guide? (I don't propose to include that in the CI)
I've been trying out other options. Here is an attempt at using @inline annotations only internally, shielding the user from that complexity. It goes a bit in the direction of the macro-based implementation.
// Binary element-wise operation on one Mat and one scalar
@inline
private def matSclrElemOpIpBody[
OP <: ScalarOp,
@spec(Int, Long, Double) A,
@spec(Int, Long, Double) B
](v1: Mat[A], v2: B, op: BinOp[OP, A, B, A]): Unit = {
val sz = v1.length
val v1a = v1.toArray
var i = 0
while (i < sz) {
v1a(i) = op(v1a(i), v2)
i += 1
}
}
implicit def matSclrElemOpIpAdd[
@spec(Int, Long, Double) A,
@spec(Int, Long, Double) B
](implicit op: BinOp[Add, A, B, A]): BinOpInPlace[Add, Mat, A, B] =
matSclrElemOpIpBody(_, _, op)
implicit def matSclrElemOpIpSub[
@spec(Int, Long, Double) A,
@spec(Int, Long, Double) B
](implicit op: BinOp[Subtract, A, B, A]): BinOpInPlace[Subtract, Mat, A, B] =
matSclrElemOpIpBody(_, _, op)
implicit def matSclrElemOpIpMul[
@spec(Int, Long, Double) A,
@spec(Int, Long, Double) B
](implicit op: BinOp[Multiply, A, B, A]): BinOpInPlace[Multiply, Mat, A, B] =
matSclrElemOpIpBody(_, _, op)
implicit def matSclrElemOpIpDiv[
@spec(Int, Long, Double) A,
@spec(Int, Long, Double) B
](implicit op: BinOp[Divide, A, B, A]): BinOpInPlace[Divide, Mat, A, B] =
matSclrElemOpIpBody(_, _, op)
// ... etc
The idea is to inline the body of the while loop once per operation to avoid megamorphic call sites. Specialization takes care of duplicating the loop further for each specialization combination. This works fine with one annoyance: it requires to raise the -XX:MaxInlineLevel to ~13 (default is 9).
In this latest you compile saddle with -opt:inline:org.saddle.** ?
In this latest you compile saddle with -opt:inline:org.saddle.** ?
Yes. That's already part of the build config. I didn't need to add anything.
Out of the three alternatives (macro, inline later, inline within saddle) which one do you prefer? If we can solve this problem with inlinin then I would remove the macro. I don't know whether there is any benefit of inlining later when the app is compiled, e.g. smaller bytecode size?
I'm hesitant on this. Both @inline and macros are dropped in scala3. I don't think we should set out on a large refactoring and rework important bits on top of deprecated features. I'm not so keen on the @inline based proposals of mine up here, because they rely on jvm optmizations or JIT options that saddle can't test properly. I plan to spend some time understanding what breeze did. They seem to have used macros in the past and now have a scalameta-based solution for scala2-3 cross building.
@pityka
breeze has quite an elegant solution with their macro (turned into a scalameta codegen plugin for cross-compiling scala 2 and 3). The expand
annotation is documented here, see a usage example of the annotation in DenseVectorOps.scala and some general info in Develop.md as well.
Here is an extract of the code generation output of DenseVectorOps.scala .
trait DenseVector_Vector_ExpandOps extends VectorOps with DenseVector_TraversalOps with DenseVector_SlicingOps {
implicit val impl_OpMulInner_DV_V_eq_S_Int: breeze.linalg.operators.OpMulInner.Impl2[DenseVector[Int], Vector[Int], Int] = {
new breeze.linalg.operators.OpMulInner.Impl2[DenseVector[Int], Vector[Int], Int] {
def apply(a: DenseVector[Int], b: Vector[Int]) = {
require(b.length == a.length, "Vectors must be the same length!")
val ad = a.data
var aoff = a.offset
var result: Int = 0
cforRange(0 until a.length) { i =>
result += ad(aoff) * b(i)
aoff += a.stride
}
result
}
implicitly[BinaryRegistry[Vector[Int], Vector[Int], OpMulInner.type, Int]].register(this)
}
}
implicit val impl_OpMulInner_DV_V_eq_S_Double: breeze.linalg.operators.OpMulInner.Impl2[DenseVector[Double], Vector[Double], Double] = {
new breeze.linalg.operators.OpMulInner.Impl2[DenseVector[Double], Vector[Double], Double] {
def apply(a: DenseVector[Double], b: Vector[Double]) = {
require(b.length == a.length, "Vectors must be the same length!")
val ad = a.data
var aoff = a.offset
var result: Double = 0.0d
cforRange(0 until a.length) { i =>
result += ad(aoff) * b(i)
aoff += a.stride
}
result
}
implicitly[BinaryRegistry[Vector[Double], Vector[Double], OpMulInner.type, Double]].register(this)
}
}
// ...
// goes on much longer
If saddle wants to refactor its BinOps implicit instances, I'd consider using this expand annotation from breeze, which is available as a separate dependency. That would provide a unified solution for scala2 and 3, with performance on par with the current macro-based one. Saddle could introduce these new "expanded implicits" without breaking binary compat.
Thanks!
I agree this is a good way forward.
I can probably find some time for this during summer or let me know if you will contribute this refactor to saddle. No rush from my side. I dont plan to use scala3 this year.
Another possibility is to do nothing and hope that scala3 will have a tasty based whole program optimizer one day.
Hard to believe that ad hoc template polymorphism (which the scalameta route is) would be the recommended way. If we are in a pinch then out of the routes considered that is the best, but globally it is a step backward.
I found some rumors that the JVM will have automatic specialisation in the mid future. This should be weighed in when considering the code generation route.
https://github.com/lampepfl/dotty/issues/11114#issuecomment-1019956981
Hotspot has support for auto-vectorization. That is, the jvm will identify certain looping patterns on arrays and generate vectorized instructions (SIMD).
Investigate whether saddle triggers these optimizations by inspecting the produced x86_64 assembly. Good candidates for the investigations are the binary operations implemented in
VecSclrElemOp
,VecVecElemOp
, etc.