mratsim / Arraymancer

A fast, ergonomic and portable tensor library in Nim with a deep learning focus for CPU, GPU and embedded devices via OpenMP, Cuda and OpenCL backends
https://mratsim.github.io/Arraymancer/
Apache License 2.0
1.32k stars 95 forks source link

Provide a BLAS-free option #422

Open mratsim opened 4 years ago

mratsim commented 4 years ago

BLAS installation has been a recurrent problem for users and CI:

As explained on IRC, replacing BLAS while providing the same performance is possible via the Laser backend already integrated in Arraymancer.

The steps are:

The features that would not be available without BLAS would be:

elcritch commented 4 years ago

Would it be possible to include a lightweight pure C BLAS like https://github.com/michael-lehn/ulmBLAS as a fallback? I'm just learning about Nim, but it I under it should be possible to create a Nimble package for something like ulmBLAS and have it configurable as an optional dependency?

Edit: Also not sure how good Nim is at link-time-optimization but for embedded devices it would be nice to be able to have unused functions automatically stripped.

mratsim commented 4 years ago

Nim strips all unused functions, it's actually quite tricky to keep them.

ulmBLAS is missing several LAPACK routines, here is what I need for the features I mentioned in the post:

Regarding ulmBLAS in general, in terms of speed it will be severely lacking compared to OpenBLAS/MKL/BLIS/Laser as the microkernel is SSE only and does not use AVX or AVX512.

Furthermore if your concern is embedded, ulmBLAS (or OpenBLAS) will need to compile in all assembly to produce a DLL while a pure Nim solution will be able to do dead-code elimination if you never use QR or LU decomposition for example.

elcritch commented 4 years ago

Ok, ulmBLAS doesn't seem like it'd be ideal. I tried compiling it on Linux and it breaks.

What is the primary goal for this issue? Is it to write naive-Nim implementations, with or without optimized ASM kernels? It seems that maybe the CLAPACK C libraries could be auto-converted to Nim. Then they could be improved over time.

Also, if there is an attempt for pure Nim impl's (that aren't ASM optimized) I wouldn't mind porting a few algo's as I'm having to one-off many in C currently (or a high level language using C matrices) and it's not enjoyable or complete. Hence why Nim/Arraymancer seems so interesting!

P.S. referring to #430 would this future work always support a compile-time only CPU target? Using LLVM / JIT'ing isn't generally feasible for embedded devices.

mratsim commented 4 years ago

What is the primary goal for this issue? Is it to write naive-Nim implementations, with or without optimized ASM kernels? It seems that maybe the CLAPACK C libraries could be auto-converted to Nim. Then they could be improved over time.

Users on Windows in particular have reported issues installing BLAS. Even for me I couldn't get Lapack working on Windows CI: https://github.com/mratsim/Arraymancer/issues/212.

Also once integrating Arraymancer to the Nim test suite as an important package, BLAS caused issue due to differing distro packaging: https://github.com/unicredit/nimblas/issues/3

And lastly, the packaging can just be plain wrong and gives wrong results for the past 6 months: https://bugs.archlinux.org/task/63054

Regarding performance, you don't need Assembly to reach BLAS performance, but you do need to know your CPU and memory. I have already implemented the main reason to use BLAS, matrix multiplication, in pure Nim and can reach OpenBLAS speed on large matrices (small matrices are still significantly slower than OpenBLAS/MKL but it's much easier to optimize those: you need to deactivate some preprocessing of optimized large matrix multiplication).

See implementation in Laser using OpenMP:

For performance, you can capture 80% of the speed by doing 1D/2D/3D tiling + vectorization. There is a detailed example on how do to 1D and 2D tiling in Laser on transposition:

In short:

My plan detailed here: https://github.com/mratsim/Arraymancer/issues/430#issuecomment-610307054 is to allow 2 different kinds of people to work on compute kernels:

proc convolve(A, Filter: Function): Function =
  # Generator for the convolution function
  var i, j: Domain
  var convH, convW: Function

  # Definition of the result function
  # Filter F = [1, 2, 1]
  # So 2D application is:
  # [[1, 2, 1],
  #  [2, 4, 2],
  #  [1, 2, 1]]
  #
  convH[i, j] = A[i-1, j] * F[0] + A[i, j] * F[1] + A[i+1, j] * F[2]
  convW[i, j] = convH[i, j-1] * F[0] + convH[i, j] * F[1] + convH[i, j] * F[2]

  # Optimization of the definition - completely optional
  convH.compute_at(convW, j) # Compute the required region of convH before convW start (i.e. merged)
  # convH.compute_root()     # Alternatively do the height convolve before the width convolve (i.e. 2 steps)

  convW.tile(i, ii, 64)      # Split i in tile of size 64, with ii as the tile iterator
  convW.parallel(i)          # Iterating on tiles is done in parallel
  convW.tile(j, jj, 64)      # Split j in tile of size 64 as well with jj as the tile iterator
  convW.vectorize(jj, 8)     # Iterating on jj is done with size 8 vectors if possible (i.e. AVX on x86)

  return convW

  generate convolve:
    proc foobar(a: Tensor[float32], filter: array[3, float32]): Tensor[float32]

P.S. referring to #430 would this future work always support a compile-time only CPU target? Using LLVM / JIT'ing isn't generally feasible for embedded devices.

Yes. I don't want LLVM to be a requirement. It's huge.

elcritch commented 4 years ago

Just a quick response, that all sounds great! I'm impressed by how much detail you've already got. While I enjoy writing in Julia, the whole LLVM dependency is just a pain for embedded. Rust is a pain to cross-compile, and seems to have problems with arrays on the stack(?). Given all that, Nim's new ARC/move semantics really makes Nim/Arraymancer very promising for embedded usage (in addition to server side).

My plan detailed here: #430 (comment) is to allow 2 different kinds of people to work on compute kernels:

The researcher/implementer only has to get the algorithm right
The optimizer/performance engineer only has to tune the performance to the architecture

That seems reasonable. Glad you summarized it as I didn't quite understand that from the linked comment. In my company's use case, simple machine learning techniques turn out pretty handy. So if there's a simple path to implement pure Nim using just the correct algorithm first, one could cover most of the basic machine learning techniques pretty quickly. Then others could refine/tune the performance. That's how I'm understanding your previous comment. I'll look into your example some more! It is fun playing with vectorization/tiling...