Open Magalame opened 5 years ago
@Magalame I definitely do have a strong desire to add SIMD support to massiv. I have tried a few things before and have been pondering on some ideas on how to best implement it for a while now. I think I might a good idea that could work. Time permitting, I'll try to implement a proof of concept in some near future and will ping you here so I can get some feedback. ;)
In short, the idea I have, is to go directly through C world with FFI, rather than using primitives from GHC.Prim
, since those require LLVM and I don't want to force anyone to install it.
@lehins I see, sounds good I'm excited :)
If as @klapaucius points out SIMD support gets to the NGC backend, it might avoid you a lot of trouble
It is really good news that ghc will finally be able generate SIMD natively.
I had a pretty good day in that direction as well. Was experimenting with implementing matrix multiplication using SIMD in C and FFI and saw a very nice speed up:
benchmarking Mult/Seq/multiplyTranspose
time 27.22 ms (27.09 ms .. 27.33 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 27.31 ms (27.25 ms .. 27.37 ms)
std dev 135.6 μs (103.9 μs .. 170.9 μs)
benchmarking Mult/Seq/multiplyTransposeSIMD
time 17.23 ms (17.04 ms .. 17.55 ms)
0.999 R² (0.998 R² .. 1.000 R²)
mean 17.14 ms (17.07 ms .. 17.29 ms)
std dev 228.4 μs (133.3 μs .. 387.6 μs)
benchmarking Mult/Par/multiplyTranspose
time 7.683 ms (7.255 ms .. 8.234 ms)
0.982 R² (0.969 R² .. 0.996 R²)
mean 7.575 ms (7.426 ms .. 7.766 ms)
std dev 474.1 μs (285.0 μs .. 635.8 μs)
variance introduced by outliers: 34% (moderately inflated)
benchmarking Mult/Par/multiplyTransposeSIMD
time 3.454 ms (3.411 ms .. 3.503 ms)
0.999 R² (0.998 R² .. 0.999 R²)
mean 3.392 ms (3.357 ms .. 3.427 ms)
std dev 100.7 μs (77.90 μs .. 146.5 μs)
variance introduced by outliers: 14% (moderately inflated)
It is still pretty far from being production ready, but it is getting me closer to having some sort of SIMD capabilities in massiv
I have to say I'm fairly impressed with the native GHC SIMD (llvm) capabilities:
benchmarking DLA/Matrix-matrix multiplication
time 2.573 ms (2.546 ms .. 2.598 ms)
0.998 R² (0.995 R² .. 0.999 R²)
mean 2.608 ms (2.584 ms .. 2.660 ms)
std dev 110.6 μs (64.93 μs .. 220.6 μs)
variance introduced by outliers: 26% (moderately inflated)
benchmarking DLA/Matrix-matrix multiplication simd
time 396.5 μs (393.3 μs .. 401.3 μs)
0.999 R² (0.998 R² .. 0.999 R²)
mean 408.4 μs (404.7 μs .. 414.3 μs)
std dev 15.62 μs (11.65 μs .. 20.91 μs)
variance introduced by outliers: 32% (moderately inflated)
benchmarking DLA/Norm
time 20.65 μs (20.34 μs .. 20.98 μs)
0.999 R² (0.998 R² .. 0.999 R²)
mean 20.75 μs (20.57 μs .. 20.95 μs)
std dev 641.9 ns (540.2 ns .. 838.4 ns)
variance introduced by outliers: 34% (moderately inflated)
benchmarking DLA/Norm simd
time 4.357 μs (4.327 μs .. 4.387 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 4.346 μs (4.322 μs .. 4.376 μs)
std dev 89.97 ns (71.59 ns .. 114.1 ns)
variance introduced by outliers: 22% (moderately inflated)
benchmarking Hmatrix/Matrix-matrix multiplication
time 334.8 μs (332.1 μs .. 337.7 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 334.7 μs (332.6 μs .. 336.7 μs)
std dev 6.768 μs (5.800 μs .. 8.143 μs)
variance introduced by outliers: 12% (moderately inflated)
benchmarking Hmatrix/Norm
time 4.822 μs (4.756 μs .. 4.889 μs)
0.999 R² (0.998 R² .. 0.999 R²)
mean 4.836 μs (4.785 μs .. 4.894 μs)
std dev 182.2 ns (153.2 ns .. 217.7 ns)
variance introduced by outliers: 48% (moderately inflated)
I started a separate repo for now for the massiv-simd
package that will eventually find its home in this repository. For now it only contains a handful of functions that can do things on 128bit Doubles(x2), but the most important thing is to figure out the overall structure of the package and see if it is really viable. So far, I think, it looks really promising, both design wise and the speed observed.
I'll look into it! Someone also recommended me this paper on the same subject: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/07/haskell-beats-C.pdf
It's concerned with the current implementation of vector
so I'm not sure it'll be 100% relevant here, but it might yield some insight regarding the possibility of a more abstract API for SIMD
@Magalame It is a very interesting read, highly recommend it too. Unfortunately vector
still doesn't use SIMD, despite what the paper says. It does give a very good insight into very powerful stream fusion that is vector
build on, although I don't think it is terribly relevant in the context of multidimensional arrays, unless the arrays are ragged.
@klapaucius That's where that branch went, I was looking all over for it, I even asked Roman about it. :D (Thank you for posting a link to it!)
Despite that it was implemented 6 years ago, unfortunately it doesn't change what I said, it is still not in vector
@klapaucius your comments always are providential!
@klapaucius your comments are always so concise! :smile:
Spoke too soon about the SIMD support in ghc. But's ok, I am sure Ben Gamari will figure it out.
Hi!
I was wondering if there were any specific plan to implement SIMD instructions. I've been playing around with them in the context of the
dense-linear-algebra
library, and they seem to be fundamental to reach BLAS-like performance. I was wondering if you were planning to introduce SIMD tomassiv
?