mratsim / Arraymancer

A fast, ergonomic and portable tensor library in Nim with a deep learning focus for CPU, GPU and embedded devices via OpenMP, Cuda and OpenCL backends
https://mratsim.github.io/Arraymancer/
Apache License 2.0
1.33k stars 95 forks source link

speed vs numpy #522

Open CuriousCat-7 opened 2 years ago

CuriousCat-7 commented 2 years ago
import nimpy
import times
import arraymancer

var 
  tic, toc: float

# for math
let np = pyImport("numpy")
tic = cpuTime()
for i in 0..<200:
  discard np.sqrt(np.cos(np.sin(np.linspace(0, 10, 1000))))
toc = cpuTime()
echo "np time: ", toc - tic

tic = cpuTime()
for i in 0..<200:
  discard sqrt(cos(sin(arraymancer.linspace(0, 10, 1000))))
toc = cpuTime()
echo "arraymancer time: ", toc - tic

Shell and output:

 nim c -r npy                                                     
Hint: used config file '/home/neo/.choosenim/toolchains/nim-1.4.8/config/nim.cfg' [Conf]
Hint: used config file '/home/neo/.choosenim/toolchains/nim-1.4.8/config/config.nims' [Conf]
.................................................................................................................................................................................................................................CC: read
CC: write
CC: stdlib_times.nim
CC: stdlib_random.nim
CC: ../../../.nimble/pkgs/arraymancer-0.7.5/arraymancer/tensor/ufunc.nim
CC: npy.nim

Hint:  [Link]
Hint: 135625 lines; 1.764s; 180.102MiB peakmem; Debug build; proj: /home/neo/work/nim-projects/nim-learn/npy; out: /home/neo/work/nim-projects/nim-learn/npy [SuccessX]
Hint: /home/neo/work/nim-projects/nim-learn/npy  [Exec]
np time: 0.012997972
arraymancer time: 0.03080387

If it is compiled with release

nim c -r -d:release npy

I get time:

np time: 0.01219863
arraymancer time: 0.007503163999999993

Could I improve the speed further?

mratsim commented 2 years ago

You can fuse sqrt cos sin in a single pass over the data

import nimpy
import times
import arraymancer

var
  tic, toc: float

# for math
let np = pyImport("numpy")
tic = epochTime()
for i in 0..<200:
  discard np.sqrt(np.cos(np.sin(np.linspace(0, 10, 1000))))
toc = epochTime()
echo "np time: ", toc - tic

tic = epochTime()
for i in 0..<200:
  discard sqrt(cos(sin(arraymancer.linspace(0, 10, 1000))))
toc = epochTime()
echo "arraymancer time: ", toc - tic

tic = epochTime()
for i in 0..<200:
  var t = arraymancer.linspace(0, 10, 1000)
  t.apply_inline():
    x.sin().cos().sqrt()
toc = epochTime()
echo "arraymancer fused time: ", toc - tic
$  nim c -d:danger --hints:off --warnings:off -d:danger -r --outdir:build build/speedtest.nim 
np time: 0.009390830993652344
arraymancer time: 0.005604982376098633
arraymancer fused time: 0.004479646682739258

Depending on the number of cores you have, using -d:openmp might also accelerate. I have 36 cores unfortunately and OpenMP doesn't deal with contention that well with the unfused code (not enough work per item).

$  nim c -d:openmp --hints:off --warnings:off -d:danger -r --outdir:build build/speedtest.nim 
np time: 0.009420156478881836
arraymancer time: 0.04207587242126465
arraymancer fused time: 0.005712270736694336

Note: for benchmarking CPU time might give you the wrong figures with parallel code that involves multiple CPUs.