lehins / massiv

Efficient Haskell Arrays featuring Parallel computation
BSD 3-Clause "New" or "Revised" License
383 stars 23 forks source link

Support for SIMD instructions #80

Open Magalame opened 5 years ago

Magalame commented 5 years ago

Hi!

I was wondering if there were any specific plan to implement SIMD instructions. I've been playing around with them in the context of the dense-linear-algebra library, and they seem to be fundamental to reach BLAS-like performance. I was wondering if you were planning to introduce SIMD to massiv?

lehins commented 5 years ago

@Magalame I definitely do have a strong desire to add SIMD support to massiv. I have tried a few things before and have been pondering on some ideas on how to best implement it for a while now. I think I might a good idea that could work. Time permitting, I'll try to implement a proof of concept in some near future and will ping you here so I can get some feedback. ;)

In short, the idea I have, is to go directly through C world with FFI, rather than using primitives from GHC.Prim, since those require LLVM and I don't want to force anyone to install it.

klapaucius commented 5 years ago

https://gitlab.haskell.org/ghc/ghc/merge_requests/1306

Magalame commented 5 years ago

@lehins I see, sounds good I'm excited :)

If as @klapaucius points out SIMD support gets to the NGC backend, it might avoid you a lot of trouble

klapaucius commented 5 years ago

Aaaaand... merged https://gitlab.haskell.org/ghc/ghc/commit/acd795583625401c5554f8e04ec7efca18814011

lehins commented 5 years ago

It is really good news that ghc will finally be able generate SIMD natively.

I had a pretty good day in that direction as well. Was experimenting with implementing matrix multiplication using SIMD in C and FFI and saw a very nice speed up:

benchmarking Mult/Seq/multiplyTranspose
time                 27.22 ms   (27.09 ms .. 27.33 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 27.31 ms   (27.25 ms .. 27.37 ms)
std dev              135.6 μs   (103.9 μs .. 170.9 μs)

benchmarking Mult/Seq/multiplyTransposeSIMD
time                 17.23 ms   (17.04 ms .. 17.55 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 17.14 ms   (17.07 ms .. 17.29 ms)
std dev              228.4 μs   (133.3 μs .. 387.6 μs)

benchmarking Mult/Par/multiplyTranspose
time                 7.683 ms   (7.255 ms .. 8.234 ms)
                     0.982 R²   (0.969 R² .. 0.996 R²)
mean                 7.575 ms   (7.426 ms .. 7.766 ms)
std dev              474.1 μs   (285.0 μs .. 635.8 μs)
variance introduced by outliers: 34% (moderately inflated)

benchmarking Mult/Par/multiplyTransposeSIMD
time                 3.454 ms   (3.411 ms .. 3.503 ms)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 3.392 ms   (3.357 ms .. 3.427 ms)
std dev              100.7 μs   (77.90 μs .. 146.5 μs)
variance introduced by outliers: 14% (moderately inflated)

It is still pretty far from being production ready, but it is getting me closer to having some sort of SIMD capabilities in massiv

Magalame commented 5 years ago

I have to say I'm fairly impressed with the native GHC SIMD (llvm) capabilities:

benchmarking DLA/Matrix-matrix multiplication
time                 2.573 ms   (2.546 ms .. 2.598 ms)
                     0.998 R²   (0.995 R² .. 0.999 R²)
mean                 2.608 ms   (2.584 ms .. 2.660 ms)
std dev              110.6 μs   (64.93 μs .. 220.6 μs)
variance introduced by outliers: 26% (moderately inflated)

benchmarking DLA/Matrix-matrix multiplication  simd
time                 396.5 μs   (393.3 μs .. 401.3 μs)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 408.4 μs   (404.7 μs .. 414.3 μs)
std dev              15.62 μs   (11.65 μs .. 20.91 μs)
variance introduced by outliers: 32% (moderately inflated)

benchmarking DLA/Norm
time                 20.65 μs   (20.34 μs .. 20.98 μs)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 20.75 μs   (20.57 μs .. 20.95 μs)
std dev              641.9 ns   (540.2 ns .. 838.4 ns)
variance introduced by outliers: 34% (moderately inflated)

benchmarking DLA/Norm simd
time                 4.357 μs   (4.327 μs .. 4.387 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 4.346 μs   (4.322 μs .. 4.376 μs)
std dev              89.97 ns   (71.59 ns .. 114.1 ns)
variance introduced by outliers: 22% (moderately inflated)

benchmarking Hmatrix/Matrix-matrix multiplication
time                 334.8 μs   (332.1 μs .. 337.7 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 334.7 μs   (332.6 μs .. 336.7 μs)
std dev              6.768 μs   (5.800 μs .. 8.143 μs)
variance introduced by outliers: 12% (moderately inflated)

benchmarking Hmatrix/Norm
time                 4.822 μs   (4.756 μs .. 4.889 μs)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 4.836 μs   (4.785 μs .. 4.894 μs)
std dev              182.2 ns   (153.2 ns .. 217.7 ns)
variance introduced by outliers: 48% (moderately inflated)
lehins commented 5 years ago

I started a separate repo for now for the massiv-simd package that will eventually find its home in this repository. For now it only contains a handful of functions that can do things on 128bit Doubles(x2), but the most important thing is to figure out the overall structure of the package and see if it is really viable. So far, I think, it looks really promising, both design wise and the speed observed.

Magalame commented 5 years ago

I'll look into it! Someone also recommended me this paper on the same subject: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/07/haskell-beats-C.pdf It's concerned with the current implementation of vector so I'm not sure it'll be 100% relevant here, but it might yield some insight regarding the possibility of a more abstract API for SIMD

lehins commented 5 years ago

@Magalame It is a very interesting read, highly recommend it too. Unfortunately vector still doesn't use SIMD, despite what the paper says. It does give a very good insight into very powerful stream fusion that is vector build on, although I don't think it is terribly relevant in the context of multidimensional arrays, unless the arrays are ragged.

klapaucius commented 5 years ago

https://github.com/mainland/vector/commits/simd

lehins commented 5 years ago

@klapaucius That's where that branch went, I was looking all over for it, I even asked Roman about it. :D (Thank you for posting a link to it!) Despite that it was implemented 6 years ago, unfortunately it doesn't change what I said, it is still not in vector

Magalame commented 5 years ago

@klapaucius your comments always are providential!

klapaucius commented 5 years ago

https://gitlab.haskell.org/ghc/ghc/merge_requests/1379

lehins commented 5 years ago

@klapaucius your comments are always so concise! :smile:

Spoke too soon about the SIMD support in ghc. But's ok, I am sure Ben Gamari will figure it out.