Provide a BLAS-free option

mratsim commented 4 years ago

BLAS installation has been a recurrent problem for users and CI:

IRC: https://irclogs.nim-lang.org/03-03-2020.html#17:24:53
In Nim CI due to Linux distros using either blas or cblas: https://github.com/unicredit/nimblas/issues/3
OpenBLAS is plain broken in Archlinux: https://github.com/mratsim/Arraymancer/issues/375
I never managed to test LAPACK in Windows CI due to DLL/path issues: https://github.com/mratsim/Arraymancer/issues/212

As explained on IRC, replacing BLAS while providing the same performance is possible via the Laser backend already integrated in Arraymancer.

The steps are:

Integers are already using Laser, we can change the constraint to not Complex: https://github.com/mratsim/Arraymancer/blob/8acfd686b96a223dfe31803978b3915ab46e699f/src/tensor/private/p_operator_blas_l2l3.nim#L112-L146
The "blasfree" flag should imply no_lapack:
- https://github.com/mratsim/Arraymancer/blob/8acfd686b96a223dfe31803978b3915ab46e699f/src/arraymancer.nim#L41-L44
- https://github.com/mratsim/Arraymancer/blob/8acfd686b96a223dfe31803978b3915ab46e699f/src/ml/ml.nim#L21-L25

The features that would not be available without BLAS would be:

Linear algebra:
- Symmetric matrices eigenvalues decomposition
- QR decomposition
- LU decomposition
- Singular Value Decomposition (SVD)
- Randomized SVD
- Least Squares Solver
Machine Learning
- Principal Component Analysis (PCA)

elcritch commented 4 years ago

Would it be possible to include a lightweight pure C BLAS like https://github.com/michael-lehn/ulmBLAS as a fallback? I'm just learning about Nim, but it I under it should be possible to create a Nimble package for something like ulmBLAS and have it configurable as an optional dependency?

Edit: Also not sure how good Nim is at link-time-optimization but for embedded devices it would be nice to be able to have unused functions automatically stripped.

mratsim commented 4 years ago

Nim strips all unused functions, it's actually quite tricky to keep them.

ulmBLAS is missing several LAPACK routines, here is what I need for the features I mentioned in the post:

[x] laswp: Pivot a Matrix A with P: A = P * A
[ ] orgqr: Extract the orthonormal matrix Q from the Householder reflectors coming from geqrf
[ ] ormqr: Multiply Householder reflectors from geqrf with with any matrix without any extraction (optional)
[ ] syevr: Extract eigenvalues from symmetric matrices
[ ] geqrf: QR decomposition produces Q as raw Householder reflectors that require orgqr or ormqr for further use.
[ ] gesdd: Singular Value Decomposition by Divide-and-Conquer
[x] getrf: Pivoted LU Decomposition
[ ] gelsd: Least Squares via Recursive Divide-and-COnquer
[x] gesv: Solve AX = B for general matrix

Regarding ulmBLAS in general, in terms of speed it will be severely lacking compared to OpenBLAS/MKL/BLIS/Laser as the microkernel is SSE only and does not use AVX or AVX512.

Furthermore if your concern is embedded, ulmBLAS (or OpenBLAS) will need to compile in all assembly to produce a DLL while a pure Nim solution will be able to do dead-code elimination if you never use QR or LU decomposition for example.

elcritch commented 4 years ago

Ok, ulmBLAS doesn't seem like it'd be ideal. I tried compiling it on Linux and it breaks.

What is the primary goal for this issue? Is it to write naive-Nim implementations, with or without optimized ASM kernels? It seems that maybe the CLAPACK C libraries could be auto-converted to Nim. Then they could be improved over time.

Also, if there is an attempt for pure Nim impl's (that aren't ASM optimized) I wouldn't mind porting a few algo's as I'm having to one-off many in C currently (or a high level language using C matrices) and it's not enjoyable or complete. Hence why Nim/Arraymancer seems so interesting!

P.S. referring to #430 would this future work always support a compile-time only CPU target? Using LLVM / JIT'ing isn't generally feasible for embedded devices.

mratsim commented 4 years ago

What is the primary goal for this issue? Is it to write naive-Nim implementations, with or without optimized ASM kernels? It seems that maybe the CLAPACK C libraries could be auto-converted to Nim. Then they could be improved over time.

Users on Windows in particular have reported issues installing BLAS. Even for me I couldn't get Lapack working on Windows CI: https://github.com/mratsim/Arraymancer/issues/212.

Also once integrating Arraymancer to the Nim test suite as an important package, BLAS caused issue due to differing distro packaging: https://github.com/unicredit/nimblas/issues/3

And lastly, the packaging can just be plain wrong and gives wrong results for the past 6 months: https://bugs.archlinux.org/task/63054

Regarding performance, you don't need Assembly to reach BLAS performance, but you do need to know your CPU and memory. I have already implemented the main reason to use BLAS, matrix multiplication, in pure Nim and can reach OpenBLAS speed on large matrices (small matrices are still significantly slower than OpenBLAS/MKL but it's much easier to optimize those: you need to deactivate some preprocessing of optimized large matrix multiplication).

See implementation in Laser using OpenMP:

https://github.com/numforge/laser/tree/d1e6ae61/laser/primitives/matrix_multiplication
benches: https://github.com/numforge/laser/tree/d1e6ae61/benchmarks/gemm And implementation using the Weave threading runtime
https://github.com/mratsim/weave/tree/ce8cac84/benchmarks/matmul_gemm_blas

For performance, you can capture 80% of the speed by doing 1D/2D/3D tiling + vectorization. There is a detailed example on how do to 1D and 2D tiling in Laser on transposition:

In short:

Blas Level1 / Lapack level 1, they only require linear iteration and a for loop will reach max speed.
- Example: Addition / Copy
- Exception: Transposition because it iterates in 2 different ways for the source and destination matrices
- Exception: Sum reduction because either you use fast-math or you deal will instruction latency when accumulating in a single memory location
  - bench: https://github.com/numforge/laser/tree/d1e6ae61/benchmarks/fp_reduction_latency
  - technique: https://github.com/numforge/laser/blob/d1e6ae61/laser/primitives/reductions.nim
Blas Level 2: they require 2D Tiling to handle "thin and skinny" matrices and "tall and fat" matrices and better use the CPU caches
- Example Matrix-Vector Multiplication
- This includes transposition

My plan detailed here: https://github.com/mratsim/Arraymancer/issues/430#issuecomment-610307054 is to allow 2 different kinds of people to work on compute kernels:

The researcher/implementer only has to get the algorithm right
The optimizer/performance engineer only has to tune the performance to the architecture I've given example in the Nim forum on convolution: https://forum.nim-lang.org/t/6164#38122
Furthermore we will likely have different optimization for CPU and GPU target for the same algorithm
And I want compiler-generated backpropagation kernels (though I'm not sure yet how to allow a performance engineer to optimize them)

proc convolve(A, Filter: Function): Function =
  # Generator for the convolution function
  var i, j: Domain
  var convH, convW: Function

  # Definition of the result function
  # Filter F = [1, 2, 1]
  # So 2D application is:
  # [[1, 2, 1],
  #  [2, 4, 2],
  #  [1, 2, 1]]
  #
  convH[i, j] = A[i-1, j] * F[0] + A[i, j] * F[1] + A[i+1, j] * F[2]
  convW[i, j] = convH[i, j-1] * F[0] + convH[i, j] * F[1] + convH[i, j] * F[2]

  # Optimization of the definition - completely optional
  convH.compute_at(convW, j) # Compute the required region of convH before convW start (i.e. merged)
  # convH.compute_root()     # Alternatively do the height convolve before the width convolve (i.e. 2 steps)

  convW.tile(i, ii, 64)      # Split i in tile of size 64, with ii as the tile iterator
  convW.parallel(i)          # Iterating on tiles is done in parallel
  convW.tile(j, jj, 64)      # Split j in tile of size 64 as well with jj as the tile iterator
  convW.vectorize(jj, 8)     # Iterating on jj is done with size 8 vectors if possible (i.e. AVX on x86)

  return convW

  generate convolve:
    proc foobar(a: Tensor[float32], filter: array[3, float32]): Tensor[float32]

P.S. referring to #430 would this future work always support a compile-time only CPU target? Using LLVM / JIT'ing isn't generally feasible for embedded devices.

Yes. I don't want LLVM to be a requirement. It's huge.

elcritch commented 4 years ago

Just a quick response, that all sounds great! I'm impressed by how much detail you've already got. While I enjoy writing in Julia, the whole LLVM dependency is just a pain for embedded. Rust is a pain to cross-compile, and seems to have problems with arrays on the stack(?). Given all that, Nim's new ARC/move semantics really makes Nim/Arraymancer very promising for embedded usage (in addition to server side).

My plan detailed here: #430 (comment) is to allow 2 different kinds of people to work on compute kernels:
The researcher/implementer only has to get the algorithm right
The optimizer/performance engineer only has to tune the performance to the architecture

That seems reasonable. Glad you summarized it as I didn't quite understand that from the linked comment. In my company's use case, simple machine learning techniques turn out pretty handy. So if there's a simple path to implement pure Nim using just the correct algorithm first, one could cover most of the basic machine learning techniques pretty quickly. Then others could refine/tune the performance. That's how I'm understanding your previous comment. I'll look into your example some more! It is fun playing with vectorization/tiling...

mratsim / Arraymancer

Provide a BLAS-free option #422