Significant performance degradation for matrix multiplication compared to ND4J for bigger matrices

Hi,

We recently explored different linear algebra libraries for our project. EJML showed really good results and we went with it originally, but then noticed some huge performance degradations on bigger matrices. I've created a sample project here: https://github.com/anatoliy-balakirev/ejml-nd4j-benchmark which is basically a small JMH benchmark, running matrix multiplication using EJML and ND4J (https://github.com/deeplearning4j/deeplearning4j). I used the following command line (which may be a bit naive as there is only one warmup run and 3 iterations, but should be good enough to highlight the issue):

./mvnw jmh:benchmark -Djmh.f=1 -Djmh.wi=1 -Djmh.i=3 -Djmh.bm=avgt

The results are as follows:

Benchmark                                               (matrixDimensions)  Mode  Cnt    Score    Error  Units
MatrixOperationBenchmark.testMatrixMultiplicationEjml   155x9441;9441x9441  avgt    3    2.410 ±  0.817   s/op
MatrixOperationBenchmark.testMatrixMultiplicationEjml  3000x3000;3000x3000  avgt    3    3.843 ±  3.063   s/op
MatrixOperationBenchmark.testMatrixMultiplicationEjml  3300x3300;3300x3300  avgt    3    5.089 ±  1.766   s/op
MatrixOperationBenchmark.testMatrixMultiplicationEjml  3500x3500;3500x3500  avgt    3    6.314 ±  4.315   s/op
MatrixOperationBenchmark.testMatrixMultiplicationEjml  4000x4000;4000x4000  avgt    3    9.395 ±  1.378   s/op
MatrixOperationBenchmark.testMatrixMultiplicationEjml  9441x9441;9441x9441  avgt    3  133.552 ± 92.515   s/op

MatrixOperationBenchmark.testMatrixMultiplicationNd4J   155x9441;9441x9441  avgt    3    0.680 ±  0.511   s/op
MatrixOperationBenchmark.testMatrixMultiplicationNd4J  3000x3000;3000x3000  avgt    3    0.661 ±  0.396   s/op
MatrixOperationBenchmark.testMatrixMultiplicationNd4J  3300x3300;3300x3300  avgt    3    0.793 ±  0.889   s/op
MatrixOperationBenchmark.testMatrixMultiplicationNd4J  3500x3500;3500x3500  avgt    3    0.890 ±  0.573   s/op
MatrixOperationBenchmark.testMatrixMultiplicationNd4J  4000x4000;4000x4000  avgt    3    1.301 ±  0.620   s/op
MatrixOperationBenchmark.testMatrixMultiplicationNd4J  9441x9441;9441x9441  avgt    3   13.279 ±  4.434   s/op

As you can see, on those sizes EJML is 6-10 times slower than ND4J. On smaller sizes (these commented out lines, which you can uncomment and give it a try: https://github.com/anatoliy-balakirev/ejml-nd4j-benchmark/blob/main/src/test/java/benchmark/MatrixOperationBenchmark.java#L88-L108) EJML is actually faster than ND4J.

The full log (where you can also see some hardware details, logged by ND4J) is here: benchmark.log

For now we ended up using EJML up to some matrix multiplication complexity and then switching to ND4J (we have a lot of matrices of those bigger sizes, so the execution time piles up). Is there any way to make EJML performance on par with ND4J for those sizes or here it's rather the maximum we can get from the pure Java version?

lessthanoptimal / ejml

Significant performance degradation for matrix multiplication compared to ND4J for bigger matrices #197