mratsim / laser

The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers
Apache License 2.0
273 stars 15 forks source link

parallel reduction #36

Closed brentp closed 5 years ago

brentp commented 5 years ago

hi, I wanted to try out laser. I have this code working:

proc pmin(s: var seq[float32]): float32 {.noInline.}=

  var min_by_thread = newSeq[float32](omp_get_max_threads())
  for v in min_by_thread.mitems:
    v = float32.high

  omp_parallel_chunks_default(s.len, chunk_offset, chunk_size):
   #[
    attachGC()
    min_by_thread[omp_get_thread_num()] = min(
        min_by_thread[omp_get_thread_num()],
        min(s[chunk_offset..<(chunk_offset + chunk_size)])
        )
    detachGC()
    ]#

    var thread_min = min_by_thread[omp_get_thread_num()]
    #echo chunk_offset, " ", chunk_size

    for idx in chunk_offset ..< chunk_offset + chunk_size:
      thread_min = min(s[idx], thread_min)
    min_by_thread[omp_get_thread_num()] = thread_min

  result = min(min_by_thread)

do I need an omp_critical section for the final result? and/or any other problems? And here is my calling code from your examples/

proc main() =
  randomize(42) # Reproducibility
  var x = newSeqWith(800_000_000, float32 rand(1.0))
  x[200_000_001] = -42.0'f32
  echo omp_get_num_threads(), " ", omp_get_max_threads()

  var t = cpuTime()
  let m = min(x)

  echo "serial  :", m, &" in {cpuTime() - t:.2f} seconds"

  for i in 0..10:
    t = cpuTime()
    let mp = x.pmin()
    doAssert abs(mp - m) < 1e-10
    echo "parallel:", mp, &" in {cpuTime() - t:.2f} seconds"

main()
mratsim commented 5 years ago

Laser is still in research mode so plenty of things are implemented but not properly exposed in a high-level API.

To do a reduction you can do it as it's done for the sum reduction:

I will create min and max tomorrow, so that they are ready to use.

Alternatively, if you use a Tensor there are 4 ways to do parallel reduction in this example: https://github.com/numforge/laser/blob/af191c086b4a98c49049ecf18f5519dc6856cc77/examples/ex05_tensor_parallel_reduction.nim#L9-L95

Note that the underlying forEachStaged macro doesn't require an Tensor exactly, just a type that exposes rank, size, shape, strides and unsafe_raw_data as described here https://github.com/numforge/laser/tree/master/laser/strided_iteration#strided-parallel-iteration-for-tensors. So it works with seq if those are defined.

mratsim commented 5 years ago

I've added reduce_min and reduce_max (and renamed sum_kernel to reduce_sum) in #39.

They only works for float32 at the moment but if needed it's easy to extend to other types.

brentp commented 5 years ago

thanks very much for your links and the new reduce_min stuff. I can get this to work from the laser src directory but if I move it elsewhere I get a long traceback ending with:

In file included from /usr/lib/gcc/x86_64-linux-gnu/5/include/x86intrin.h:35:0,
                 from /home/brentp/.cache/nim/pmin_r/.nimble7pkgs7Laser-0.0.17laser7primitives7simd__math7reductions__sse3.nim.c:10:
/usr/lib/gcc/x86_64-linux-gnu/5/include/pmmintrin.h:68:1: error: inlining failed in call to always_inline ‘_mm_movehdup_ps’: target specific option mismatch
 _mm_movehdup_ps (__m128 __X)
 ^
/home/brentp/.cache/nim/pmin_r/.nimble7pkgs7Laser-0.0.17laser7primitives7simd__math7reductions__sse3.nim.c:56:7: error: called from here
  shuf = _mm_movehdup_ps(vec);

I can move the same file containing:

import
  random, sequtils,
  laser/primitives/reductions

proc main() =
  let interval = -1f .. 1f
  let size = 10_000_000
  let buf = newSeqWith(size, rand(interval))
  echo reduce_min(buf[0].unsafeAddr, buf.len)

main()

in and out of ~/src/laser and it works in the directory and does not without. I am compiling with nim c -d:openmp -d:danger -d:fastmath -a -r pmin.nim

brentp commented 5 years ago

btw, this gives a nearly 5X speed improvement on my laptop on my example use-case so this will be a nice improvement!

mratsim commented 5 years ago

That's unfortunately one of Nim limitations.

If you look into reductions_sse3 file it calls min_ps_sse3 https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/laser/primitives/simd_math/reductions_sse3.nim#L59 which uses sse3 intrinsics from https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/laser/primitives/private/sse3_utils.nim#L8-L18

On x86_64 the compiler can only assume SSE2 support and more advanced SIMD instructions require an explicit compiler flag.

As I want the library to have a fallback when no SSE3 is available I can't just {.passC:"-msse3".} (though you can).

So the SSE3 flag is passed per-file (instead of globally) via an undocumented feature of nim.cfg: https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/nim.cfg#L32.

So you need to add yourfilename.always = "-msse3" if you use the primitive outside of laser. Note that I don't pass define sse3_utils.always because min_ps_sse3 is inline and so not present in the sse3_utils C file.

Ultimately, @Araq said that he wants to provide a way to in .nim file to have per-file compilation flags which would be very helpful.

brentp commented 5 years ago

got it. thanks for the explanation.