mratsim / laser

The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers
Apache License 2.0
277 stars 15 forks source link

performance of gemm_strided vs numpy #23

Open timotheecour opened 5 years ago

timotheecour commented 5 years ago

python

time python $timn_D/tests/nim/all/t0147.py
1000.0
python $timn_D/tests/nim/all/t0147.py  5.26s user 0.13s system 293% cpu 1.840 total
import numpy as np
p=1000
a=np.ones((p,p))
b=np.ones((p,p))

for i in np.arange(100):
  c=np.matmul(a,b)

print(c[0,0])

laser

nim c -d:release -d:case2 $timn_D/src/timn/apps/laser.nim
time $timn_D/src/timn/apps/laser
1000.0
$timn_D/src/timn/apps/laser  5.35s user 0.03s system 99% cpu 5.405 total
import pkg/laser/primitives/matrix_multiplication/gemm

when defined(case2):
  proc test =
    # todo: different numbers
    let p1 = 1000
    let p2 = p1
    let p3 = p1

    type T = float

    var a = newSeq[T](p1 * p2)
    for i in 0..<a.len: a[i] = 1.0
    var b = newSeq[T](p2 * p3)
    for i in 0..<b.len: b[i] = 1.0
    var c = newSeq[T](p1 * p3)

    for i in 0..<100:
      gemm_strided(
        p1, p2, p3, # CHECKME ; not sure if order correct, would be nice to document M,N,K in `gemm_strided`
        1.0,
        a[0].addr, p1, 1,
        b[0].addr, p2, 1,

        0.0,
        c[0].addr, p1, 1,
      )
    # echo a
    # echo b
    echo c[0]

test()
timotheecour commented 5 years ago

need -d:openmp -d:release

... and then need to fix on my system the paths to avoid this: /bin/sh: /usr/local/bin/gcc-7: No such file or directory