mratsim / Arraymancer

A fast, ergonomic and portable tensor library in Nim with a deep learning focus for CPU, GPU and embedded devices via OpenMP, Cuda and OpenCL backends
https://mratsim.github.io/Arraymancer/
Apache License 2.0
1.33k stars 96 forks source link

Julia benchmark includes compilation time & memory? #278

Open oxinabox opened 6 years ago

oxinabox commented 6 years ago

I wasn't sure if you were aware of this, but the julia benchmarks look like they are including the compilation time and associated memory use. (I am not sure, maybe they are not. I don't see the benchmark running script, only the benchmark scripts, and xtime.rb) Julia being a JIT compiled language.

I'm not sure if this is also true for python + numba as well? But nim is always compiled in ahead of time, right?

I imagine the nim solution would also be allocating a lot more memory if you were including the compile step, that is normally done once.

This may, or may not be what you intended to do. That is not clear to me.

In a real application benchmark, including that JIT compilation time may makes sense.

Generally, on "microbenchmarks" you probably don't want to include that compilation time. Since the operation done in the microbenchmark is expected to be one you perform multiple times. So you would only be performing the compilation once.

The normal julia way of doing this benchmark, when in julia and compairing two implementations, would be more like

using BenchmarkTools 

function main(n)
  n = round(Int, n / 2 * 2)
  a = rand(0:100_000_000, n, n)
  b = a
  c = a * b
  v = round(Int, n/2) + 1
  println(c[v, v])
end

function when_isMainModule()
  n = 100
  if length(ARGS) >= 1
    n = parse(Int, ARGS[1])
  end
  @btime main($n)
end

when_isMainModule()

@btime is from Benchmark tools and also does things like running multiple runs, and ensuring CPU warmup etc.

On that (on my older computer) I get the more reasonable 4.218 s (19 allocations: 34.33 MiB)

However, now the problem with this one, is that it is only including the memory allocated by doing the operation. Where as your current benchmarks are also including the memory of the runtime environment And julia does use about a fair bit of memory just on its own without running anything.

Possibly, if one really cared about getting the benchmarks right, it might be best to have all the reporting done from within the environment. By rewriting all 3 scripts to include local timing and memory code?

I still expect nim would win, cos you have a great BLAS here for that. but the memory usage would be more consistent with what is actually expected to happen when the operation runs.

mratsim commented 6 years ago

I basically copy-pasted the matrix multiplication benchmark from https://github.com/kostya/benchmarks#matmul, it also includes JIT.

I'm fine with having both Julia JIT and AOT in the benchmarks, in any case those are pretty old now (almost a year) so I need to updated.

Feel free to PR an alternative AOT implementation and I will update all benches.

mratsim commented 4 years ago

Note I've updated the benchmark figures in https://github.com/mratsim/Arraymancer/commit/bd05448586f68aa0a4d8580932a8de5d534f3223