numcl / numcl

Numpy clone in Common Lisp
https://numcl.github.io/numcl/
Other
639 stars 32 forks source link

Benchmarking & optimization #4

Open guicho271828 opened 5 years ago

guicho271828 commented 5 years ago

The first step toward optimization is to know where you are now.

guicho271828 commented 5 years ago

Einsum GEMM benchmark

100x100 single-float matrix gemm, 1000 fold loop. Benchmarking code is in https://github.com/numcl/numcl-benchmarks

version runtime (sec) comment
Numpy einsum gemm ("ij,jk->ik") 0.4074
Numpy BLAS gemm 0.0831
Numcl einsum gemm ('(ij jk -> ik))
@ b6f2060 42.211 naive matrix element access thuru array header
@ bb14032 25.244 lift array load/store outside the loops
@ 3ce03b0 3.883 naive row-major access e.g. (+ j (* i N)). Ensure unboxed calc
@ 1b6c02d 2.848 Decompose and lift the row-major index computation and remove redundancy e.g. compute (*i N) early. HACK
@ 3ce03b0 1.585 Reimplement above technique more seriously: represent it as an SSA and perform topological sort
@ e083166 1.093 Avoid MUL in index calculation (* i N) by increasing i by steps (i.e. by addition)
@ e2e72b9 0.74 unrolling the bottom loop (8-fold)
guicho271828 commented 5 years ago

Math functions are now rewritten in Einsum and takes 1x-2x runtime as numpy. common-lisp backend.

title                       numcl    numpy    cl/py
5math/acos/0                0.8190   0.4613   1.775
5math/asin/0                0.9410   0.4186   2.248
5math/atan/0                0.6320   0.5309   1.19
5math/cos/0                 0.7640   0.3742   2.042
5math/cosh/0                0.5080   0.4545   1.118
5math/exp/0                 0.6890   0.6537   1.054
5math/log/0                 1.3540   0.5283   2.563
5math/sin/0                 0.9670   0.7887   1.226
5math/sinh/0                0.8200   0.5623   1.458
5math/tan/0                 0.6740   0.5237   1.287
5math/tanh/0                0.4690   0.3997   1.173