Open guicho271828 opened 5 years ago
100x100 single-float matrix gemm, 1000 fold loop. Benchmarking code is in https://github.com/numcl/numcl-benchmarks
version | runtime (sec) | comment |
---|---|---|
Numpy einsum gemm ("ij,jk->ik" ) |
0.4074 | |
Numpy BLAS gemm | 0.0831 | |
Numcl einsum gemm ('(ij jk -> ik) ) |
||
@ b6f2060 | 42.211 | naive matrix element access thuru array header |
@ bb14032 | 25.244 | lift array load/store outside the loops |
@ 3ce03b0 | 3.883 | naive row-major access e.g. (+ j (* i N)) . Ensure unboxed calc |
@ 1b6c02d | 2.848 | Decompose and lift the row-major index computation and remove redundancy e.g. compute (*i N) early. HACK |
@ 3ce03b0 | 1.585 | Reimplement above technique more seriously: represent it as an SSA and perform topological sort |
@ e083166 | 1.093 | Avoid MUL in index calculation (* i N) by increasing i by steps (i.e. by addition) |
@ e2e72b9 | 0.74 | unrolling the bottom loop (8-fold) |
Math functions are now rewritten in Einsum and takes 1x-2x runtime as numpy. common-lisp backend.
title numcl numpy cl/py
5math/acos/0 0.8190 0.4613 1.775
5math/asin/0 0.9410 0.4186 2.248
5math/atan/0 0.6320 0.5309 1.19
5math/cos/0 0.7640 0.3742 2.042
5math/cosh/0 0.5080 0.4545 1.118
5math/exp/0 0.6890 0.6537 1.054
5math/log/0 1.3540 0.5283 2.563
5math/sin/0 0.9670 0.7887 1.226
5math/sinh/0 0.8200 0.5623 1.458
5math/tan/0 0.6740 0.5237 1.287
5math/tanh/0 0.4690 0.3997 1.173
The first step toward optimization is to know where you are now.