guicho271828 commented 5 years ago

The first step toward optimization is to know where you are now.

[x] Write a benchmark against numpy & julia
ideas
- [ ] broadcasting -- switch to the compilation based approach, similar to einsum
- einsum
  - [x] consider caching the fixed matrix element access
  - [ ] unrolling e.g. 1 2
  - [ ] blocking
- [ ] blas integration
- [ ] GPU integration (seamless)

guicho271828 commented 5 years ago

Einsum GEMM benchmark

100x100 single-float matrix gemm, 1000 fold loop. Benchmarking code is in https://github.com/numcl/numcl-benchmarks

version	runtime (sec)	comment
Numpy einsum gemm (`"ij,jk->ik"`)	0.4074
Numpy BLAS gemm	0.0831
Numcl einsum gemm (`'(ij jk -> ik)`)
@ b6f2060	42.211	naive matrix element access thuru array header
@ bb14032	25.244	lift array load/store outside the loops
@ 3ce03b0	3.883	naive row-major access e.g. `(+ j (* i N))`. Ensure unboxed calc
@ 1b6c02d	2.848	Decompose and lift the row-major index computation and remove redundancy e.g. compute `(*i N)` early. HACK
@ 3ce03b0	1.585	Reimplement above technique more seriously: represent it as an SSA and perform topological sort
@ e083166	1.093	Avoid MUL in index calculation `(* i N)` by increasing `i` by steps (i.e. by addition)
@ e2e72b9	0.74	unrolling the bottom loop (8-fold)

guicho271828 commented 5 years ago

Math functions are now rewritten in Einsum and takes 1x-2x runtime as numpy. common-lisp backend.

title                       numcl    numpy    cl/py
5math/acos/0                0.8190   0.4613   1.775
5math/asin/0                0.9410   0.4186   2.248
5math/atan/0                0.6320   0.5309   1.19
5math/cos/0                 0.7640   0.3742   2.042
5math/cosh/0                0.5080   0.4545   1.118
5math/exp/0                 0.6890   0.6537   1.054
5math/log/0                 1.3540   0.5283   2.563
5math/sin/0                 0.9670   0.7887   1.226
5math/sinh/0                0.8200   0.5623   1.458
5math/tan/0                 0.6740   0.5237   1.287
5math/tanh/0                0.4690   0.3997   1.173

numcl / numcl

Benchmarking & optimization #4

Einsum GEMM benchmark