pityka / saddle

SADDLE: Scala Data Library
https://pityka.github.io/saddle/
Apache License 2.0
38 stars 12 forks source link

Loop optimizations investigation #428

Open tnielens opened 2 years ago

tnielens commented 2 years ago

Hotspot has support for auto-vectorization. That is, the jvm will identify certain looping patterns on arrays and generate vectorized instructions (SIMD).

Investigate whether saddle triggers these optimizations by inspecting the produced x86_64 assembly. Good candidates for the investigations are the binary operations implemented in VecSclrElemOp, VecVecElemOp, etc.

pityka commented 2 years ago

See https://github.com/pityka/saddle/commit/bc8f2dedcf82424e6e00960837f1032fb0ad83b5 on how to inspect the machine instructions.

Initial impressions:

pityka commented 2 years ago

Pushed a commit with updated jmh runs.

@tnielens You asked a few weeks ago about those macro based inlined ops. It turns out there are some jmh benchmarks in the tree for them. At least in some of these benchmarks the macros are on par with the raw arrays, while the regular ops are much slower. In some other of these runs all 3 implementations were similar.

Also I am not sure how to interpet the dot product results. I expected that the manual unroll will be the slowest as that is supposed not to trigger the auto vectorization (following the linked presentation).

[info] Benchmark                                   (size)   Mode  Cnt         Score         Error  Units
[info] DiagOuterMBench.blas                         10000  thrpt    6       344.898 ±       7.183  ops/s
[info] DiagOuterMBench.java                         10000  thrpt    6       199.842 ±       2.361  ops/s
[info] DiagOuterMBench.java_unroll                  10000  thrpt    6       342.964 ±       6.161  ops/s
[info] DotBench.blas                                  500  thrpt    6   4513031.226 ±  717923.742  ops/s
[info] DotBench.blas                                 1000  thrpt    6   3131796.506 ±  176731.660  ops/s
[info] DotBench.blas                                10000  thrpt    6    272513.086 ±   42022.945  ops/s
[info] DotBench.raw                                   500  thrpt    6   2062884.020 ±   48700.200  ops/s
[info] DotBench.raw                                  1000  thrpt    6   1028810.536 ±   33906.870  ops/s
[info] DotBench.raw                                 10000  thrpt    6    101922.090 ±    3294.607  ops/s
[info] DotBench.raw_unrolled                          500  thrpt    6   4120157.830 ±   67755.724  ops/s
[info] DotBench.raw_unrolled                         1000  thrpt    6   2122200.069 ±   49000.282  ops/s
[info] DotBench.raw_unrolled                        10000  thrpt    6    205568.475 ±    4060.935  ops/s
[info] FourInPlaceScalarLongOpBench.array              10  thrpt    6    828314.792 ±   10742.271  ops/s
[info] FourInPlaceScalarLongOpBench.array             100  thrpt    6      8377.487 ±     495.774  ops/s
[info] FourInPlaceScalarLongOpBench.array           10000  thrpt    6         0.772 ±       0.014  ops/s
[info] FourInPlaceScalarLongOpBench.saddleInlined      10  thrpt    6    839221.606 ±   13943.735  ops/s
[info] FourInPlaceScalarLongOpBench.saddleInlined     100  thrpt    6      8104.838 ±     249.580  ops/s
[info] FourInPlaceScalarLongOpBench.saddleInlined   10000  thrpt    6         0.807 ±       0.025  ops/s
[info] FourInPlaceScalarLongOpBench.saddleVirtual      10  thrpt    6    407474.259 ±    8826.223  ops/s
[info] FourInPlaceScalarLongOpBench.saddleVirtual     100  thrpt    6      3569.076 ±    1871.019  ops/s
[info] FourInPlaceScalarLongOpBench.saddleVirtual   10000  thrpt    6         0.288 ±       0.009  ops/s
[info] FourInPlaceScalarOpBench.array                  10  thrpt    6   3365232.709 ±  148713.142  ops/s
[info] FourInPlaceScalarOpBench.array                 100  thrpt    6     31907.610 ±     759.861  ops/s
[info] FourInPlaceScalarOpBench.array               10000  thrpt    6         2.041 ±       0.045  ops/s
[info] FourInPlaceScalarOpBench.saddleInlined          10  thrpt    6   3222275.123 ±   87499.198  ops/s
[info] FourInPlaceScalarOpBench.saddleInlined         100  thrpt    6     31790.690 ±     471.759  ops/s
[info] FourInPlaceScalarOpBench.saddleInlined       10000  thrpt    6         2.044 ±       0.043  ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual          10  thrpt    6    735414.566 ±   13669.387  ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual         100  thrpt    6      6774.385 ±    3574.689  ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual       10000  thrpt    6         0.437 ±       0.015  ops/s
[info] SingleInPlaceScalarOpBench.array                10  thrpt    6  35472703.133 ± 1859036.673  ops/s
[info] SingleInPlaceScalarOpBench.array               100  thrpt    6    344887.796 ±    6034.971  ops/s
[info] SingleInPlaceScalarOpBench.array             10000  thrpt    6        11.812 ±       0.998  ops/s
[info] SingleInPlaceScalarOpBench.saddleInlined        10  thrpt    6  34920249.527 ± 1743154.820  ops/s
[info] SingleInPlaceScalarOpBench.saddleInlined       100  thrpt    6    350826.518 ±    4471.170  ops/s
[info] SingleInPlaceScalarOpBench.saddleInlined     10000  thrpt    6        11.523 ±       0.163  ops/s
[info] SingleInPlaceScalarOpBench.saddleVirtual        10  thrpt    6  34390962.439 ± 2473616.086  ops/s
[info] SingleInPlaceScalarOpBench.saddleVirtual       100  thrpt    6    351251.137 ±    2994.687  ops/s
[info] SingleInPlaceScalarOpBench.saddleVirtual     10000  thrpt    6        11.508 ±       0.383  ops/s
[info] TwoInPlaceScalarOpBench.array                   10  thrpt   10  21418287.695 ±  710209.148  ops/s
[info] TwoInPlaceScalarOpBench.array                  100  thrpt   10    173638.571 ±    3125.913  ops/s
[info] TwoInPlaceScalarOpBench.array                10000  thrpt   10         5.823 ±       0.115  ops/s
[info] TwoInPlaceScalarOpBench.saddleInlined           10  thrpt   10  19015864.107 ±  296979.926  ops/s
[info] TwoInPlaceScalarOpBench.saddleInlined          100  thrpt   10    175873.725 ±    5163.732  ops/s
[info] TwoInPlaceScalarOpBench.saddleInlined        10000  thrpt   10         6.166 ±       0.129  ops/s
[info] TwoInPlaceScalarOpBench.saddleVirtual           10  thrpt   10  20191119.494 ±  364440.585  ops/s
[info] TwoInPlaceScalarOpBench.saddleVirtual          100  thrpt   10    179552.482 ±    3896.277  ops/s
[info] TwoInPlaceScalarOpBench.saddleVirtual        10000  thrpt   10         6.208 ±       0.042  ops/s

jdk17.02, older intel cpu with avx1

tnielens commented 2 years ago

Awesome!

Great post you linked: https://blogs.oracle.com/javamagazine/post/java-hotspot-hsdis-disassembler

Good to know about the existing benchmarks. I wanna have a look at them.

tnielens commented 2 years ago

Some remarks about FourInPlaceScalarOpBench and why saddleVirtual performs worse. Note that TwoInPlaceScalarOpBench performs equally well, and it starts to deteriorate from 3 subsequent different operators. This is probably due to type optimistic optimizations that are deoptimized once the third different operator is observed by the vm. In MatMatElemOp.apply, the wrapped scala bin operator op, is a virtual call. If a single (or two) subtype are encountered by the vm, it will generate code optimized for it, and enable inlining and vectorization. This is described in the openjdk wiki: Virtual (and interface) invocations with a lopsided type profile are compiled with an optimistic check in favor of the historically common type (or two types)..

This can observed with this benchmark:

@State(Scope.Benchmark)
// important
@Fork(0)
@Threads(1)
class FourInPlaceScalarOpBench {
  @Param(Array("100"))
  var size: Int = _

  var m1: Mat[Double] = _
  var b: Double = _

  @Setup(Level.Iteration)
  def setup() = {
    m1 = mat.randn(size, size)
    b = scala.util.Random.nextDouble()
  }
  @Benchmark
  def saddleVirtual1Add(): Mat[Double] = {
    import org.saddle.ops.BinOps._
    m1 += b
    m1
  }
  @Benchmark
  def saddleVirtual1Mul(): Mat[Double] = {
    import org.saddle.ops.BinOps._
    m1 *= b
    m1
  }
  @Benchmark
  def saddleVirtual1Sub(): Mat[Double] = {
    import org.saddle.ops.BinOps._
    m1 -= b
    m1
  }
  @Benchmark
  def saddleVirtual2Add(): Mat[Double] = {
    import org.saddle.ops.BinOps._
    m1 += b
    m1
  }
  @Benchmark
  def saddleVirtual2Mul(): Mat[Double] = {
    import org.saddle.ops.BinOps._
    m1 *= b
    m1
  }

  @Benchmark
  def saddleVirtual2Sub(): Mat[Double] = {
    import org.saddle.ops.BinOps._
    m1 -= b
    m1
  }
}

Results:

[info] Benchmark                                   (size)   Mode  Cnt       Score   Error  Units
[info] FourInPlaceScalarOpBench.saddleVirtual1Add     100  thrpt    2  913836.177          ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual1Mul     100  thrpt    2  949849.358          ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual1Sub     100  thrpt    2   27578.577          ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual2Add     100  thrpt    2   27544.876          ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual2Mul     100  thrpt    2   17066.358          ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual2Sub     100  thrpt    2   27697.968          ops/s

This behavior is also described in one of the jmh samples.

pityka commented 2 years ago

Very interesting. Thank you! Unfortunately the Op is very likely to be strongly polymorphic in real world client code ie that virtual call ordeoptimization will occur unless inlined by the macro.

I will open a pr to document this.

Another possibility is to rework the saddle types to eliminate the Op interface. However this would likely lead to the same inlined code as in the macro module.

I would not 100% oppose simplifying this part of saddle, that is removing the whole BinOp abstraction machinery.

tnielens commented 2 years ago

Adding extra @inline annotations seems to cancel the megamorphic deoptimizations. The loop should be fully inlined at callsite and optimized separately. Extra annotations on the following methods brings regular BinOps on par with hand-written while loops and macros.

// BinOpMat
  @inline
  implicit def MatSclrElmOpIpDDD[Op <: ScalarOp](implicit
      op: BinOp[Op, Double, Double, Double]
  )
  final class MatSclrElemOp[
      OP <: ScalarOp,
      @spec(Boolean, Int, Long, Double) A,
      @spec(Boolean, Int, Long, Double) B,
      @spec(Boolean, Int, Long, Double) C: ST
  ](val op: BinOp[OP, A, B, C])
      extends BinOp[OP, Mat[A], B, Mat[C]] {

    @inline def apply(v1: Mat[A], v2: B) = {
    ...

 // NumericOps
  @inline    
  final def +=[B](
      other: B
  )(implicit op: BinOpInPlace[Add, This, B]): Unit =
    op(this, other)
  @inline 
  final def -=[B](
      other: B
  )(implicit op: BinOpInPlace[Subtract, This, B]): Unit =
    op(this, other)
// 

Results

[info] Benchmark                                   (size)   Mode  Cnt        Score   Error  Units
[info] FourInPlaceScalarOpBench.saddleVirtual1Add     100  thrpt       1004988.957          ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual1Div     100  thrpt        212128.273          ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual1Mul     100  thrpt        998022.449          ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual1Sub     100  thrpt        991270.099          ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual2Add     100  thrpt        994354.111          ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual2Div     100  thrpt        211754.855          ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual2Mul     100  thrpt        998168.775          ops/s
[info] FourInPlaceScalarOpBench.saddleVirtual2Sub     100  thrpt        973682.103          ops/s

That said, the scala optimizer inlining must be enabled explicitely in user code for these annotations to take effect. Details are here and there are some caveats with binary compatibility. Libraries based on saddle should not enable inlining.

pityka commented 2 years ago

Thank you for the investigation. If I understand that means that if we sprinkle enough annotations and the very end user of the final application adds -opt:inline:** (and potentially -opt:local ) then the megamorphic call sites will be inlined. I would vote for this solution in favor the macros. Only thing is we need a suite to ensure this happens.

What is your suggestion how should we proceed on these? Should we first make a benchmarking suite where all operations and methods are checked against a hand written array based implementation and use that as a guide? (I don't propose to include that in the CI)

tnielens commented 2 years ago

I've been trying out other options. Here is an attempt at using @inline annotations only internally, shielding the user from that complexity. It goes a bit in the direction of the macro-based implementation.

  // Binary element-wise operation on one Mat and one scalar
  @inline
  private def matSclrElemOpIpBody[
      OP <: ScalarOp,
      @spec(Int, Long, Double) A,
      @spec(Int, Long, Double) B
  ](v1: Mat[A], v2: B, op: BinOp[OP, A, B, A]): Unit = {
    val sz = v1.length
    val v1a = v1.toArray
    var i = 0
    while (i < sz) {
      v1a(i) = op(v1a(i), v2)
      i += 1
    }
  }

  implicit def matSclrElemOpIpAdd[
      @spec(Int, Long, Double) A,
      @spec(Int, Long, Double) B
  ](implicit op: BinOp[Add, A, B, A]): BinOpInPlace[Add, Mat, A, B] =
    matSclrElemOpIpBody(_, _, op)
  implicit def matSclrElemOpIpSub[
      @spec(Int, Long, Double) A,
      @spec(Int, Long, Double) B
  ](implicit op: BinOp[Subtract, A, B, A]): BinOpInPlace[Subtract, Mat, A, B] =
    matSclrElemOpIpBody(_, _, op)
  implicit def matSclrElemOpIpMul[
      @spec(Int, Long, Double) A,
      @spec(Int, Long, Double) B
  ](implicit op: BinOp[Multiply, A, B, A]): BinOpInPlace[Multiply, Mat, A, B] =
    matSclrElemOpIpBody(_, _, op)
  implicit def matSclrElemOpIpDiv[
      @spec(Int, Long, Double) A,
      @spec(Int, Long, Double) B
  ](implicit op: BinOp[Divide, A, B, A]): BinOpInPlace[Divide, Mat, A, B] =
    matSclrElemOpIpBody(_, _, op)
    // ... etc

The idea is to inline the body of the while loop once per operation to avoid megamorphic call sites. Specialization takes care of duplicating the loop further for each specialization combination. This works fine with one annoyance: it requires to raise the -XX:MaxInlineLevel to ~13 (default is 9).

pityka commented 2 years ago

In this latest you compile saddle with -opt:inline:org.saddle.** ?

tnielens commented 2 years ago

In this latest you compile saddle with -opt:inline:org.saddle.** ?

Yes. That's already part of the build config. I didn't need to add anything.

pityka commented 2 years ago

Out of the three alternatives (macro, inline later, inline within saddle) which one do you prefer? If we can solve this problem with inlinin then I would remove the macro. I don't know whether there is any benefit of inlining later when the app is compiled, e.g. smaller bytecode size?

tnielens commented 2 years ago

I'm hesitant on this. Both @inline and macros are dropped in scala3. I don't think we should set out on a large refactoring and rework important bits on top of deprecated features. I'm not so keen on the @inline based proposals of mine up here, because they rely on jvm optmizations or JIT options that saddle can't test properly. I plan to spend some time understanding what breeze did. They seem to have used macros in the past and now have a scalameta-based solution for scala2-3 cross building.

tnielens commented 2 years ago

@pityka breeze has quite an elegant solution with their macro (turned into a scalameta codegen plugin for cross-compiling scala 2 and 3). The expand annotation is documented here, see a usage example of the annotation in DenseVectorOps.scala and some general info in Develop.md as well.

Here is an extract of the code generation output of DenseVectorOps.scala .

trait DenseVector_Vector_ExpandOps extends VectorOps with DenseVector_TraversalOps with DenseVector_SlicingOps {
  implicit val impl_OpMulInner_DV_V_eq_S_Int: breeze.linalg.operators.OpMulInner.Impl2[DenseVector[Int], Vector[Int], Int] = {
    new breeze.linalg.operators.OpMulInner.Impl2[DenseVector[Int], Vector[Int], Int] {
      def apply(a: DenseVector[Int], b: Vector[Int]) = {
        require(b.length == a.length, "Vectors must be the same length!")
        val ad = a.data
        var aoff = a.offset
        var result: Int = 0
        cforRange(0 until a.length) { i => 
          result += ad(aoff) * b(i)
          aoff += a.stride
        }
        result
      }
      implicitly[BinaryRegistry[Vector[Int], Vector[Int], OpMulInner.type, Int]].register(this)
    }
  }
  implicit val impl_OpMulInner_DV_V_eq_S_Double: breeze.linalg.operators.OpMulInner.Impl2[DenseVector[Double], Vector[Double], Double] = {
    new breeze.linalg.operators.OpMulInner.Impl2[DenseVector[Double], Vector[Double], Double] {
      def apply(a: DenseVector[Double], b: Vector[Double]) = {
        require(b.length == a.length, "Vectors must be the same length!")
        val ad = a.data
        var aoff = a.offset
        var result: Double = 0.0d
        cforRange(0 until a.length) { i => 
          result += ad(aoff) * b(i)
          aoff += a.stride
        }
        result
      }
      implicitly[BinaryRegistry[Vector[Double], Vector[Double], OpMulInner.type, Double]].register(this)
    }
  }
  // ...
  // goes on much longer

If saddle wants to refactor its BinOps implicit instances, I'd consider using this expand annotation from breeze, which is available as a separate dependency. That would provide a unified solution for scala2 and 3, with performance on par with the current macro-based one. Saddle could introduce these new "expanded implicits" without breaking binary compat.

pityka commented 2 years ago

Thanks!

I agree this is a good way forward.

I can probably find some time for this during summer or let me know if you will contribute this refactor to saddle. No rush from my side. I dont plan to use scala3 this year.

pityka commented 2 years ago

Another possibility is to do nothing and hope that scala3 will have a tasty based whole program optimizer one day.

Hard to believe that ad hoc template polymorphism (which the scalameta route is) would be the recommended way. If we are in a pinch then out of the routes considered that is the best, but globally it is a step backward.

pityka commented 2 years ago

I found some rumors that the JVM will have automatic specialisation in the mid future. This should be weighed in when considering the code generation route.

https://github.com/lampepfl/dotty/issues/11114#issuecomment-1019956981