GPUVector from Accelerate Backend

o1lo01ol1o commented 8 years ago

I've been going through Accelerate and https://github.com/mikeizbicki/subhask/blob/master/src/SubHask/Algebra/Vector.hs and I'm not exactly certain that integration will be smooth.

Maybe there's an elegant solution that escapes me, but I think that implementing a new Accelerate vector type will require creating wrappers for all of SubHask's algebraic functions that implement the corresponding functions in the Accelerate DSL under the hood.

As I understand it, (see here for a simple overview), Accelerate expects all expressions to be of the type Exp and be composed of elements Elt or accelerate Array. The later can be arbitrary dimensionality that must be specified in the type signature. The actual computational expressions cannot be composed of nested arrays, only tuples of Elt or Array, have their own accelerate implementations (ie, Accelerate.fold), and are most efficient when they are expressed as a complete accelerate function and then computed using run, run1, or CUDA.stream.

(I found the kmeans example to be succinct.)

With the exception of the latter function, Accelerate functions are backend-agnostic so can be run on whatever backend is available, which is sort of nice. Unfortunately, Accelerate doesn't have working OpenCL support and more unfortunately it doesn't have cuBLAS blindings (there are some cuda 6.0 bindings here but it doesn't look like they't hook into accerlate without some work: https://github.com/bmsherman/cublas). I also can't find any cuDNN bindings in Haskell. I'd imagine whatever gpu support you'd want would need to support these libraries, if only for the sake of HLearn. Unless of course someone wanted to do hand-rolled cuda / OpenCL code that could be compiled and called from the Haskell FFI.

mikeizbicki commented 8 years ago

As I understand it, (see here for a simple overview), Accelerate expects all expressions to be of the type Exp and be composed of elements Elt or accelerate Array. The later can be arbitrary dimensionality that must be specified in the type signature. The actual computational expressions cannot be composed of nested arrays, only tuples of Elt or Array, have their own accelerate implementations (ie, Accelerate.fold), and are most efficient when they are expressed as a complete accelerate function and then computed using run, run1, or CUDA.stream.

I don't see any of this as being a problem, but I haven't thought too hard about it so I might be missing something.

Why not have something like:

import qualified Prelude as Prelude
import Data.Array.Accelerate as A
import Data.Array.Accelerate.Interpreter as I

data Backend 
    = Interpreter 
    | CUDA
    | OpenCL
    | Repa
    | LLVM

-- | The backend phantom variable is used to statically encode the backend that will run the AccVector,
-- and the n phantom variable encodes the size of the vector similar to how is done for the SVector/UVector types.
newtype AccVector (backend::Backend) (n::k) a = AccVector (Acc (Array DIM1 a))

-- | This type synonym would ensure the types used in AccVector satisfy all the needed constraints.
-- There's a similar one in the hmatrix compatibility layer you could look at if you wanted.
type ValidAcc a = ... :: Constraint

-- There's no need to run the array inside the algebraic expressions.
-- Here, we're just building the syntax tree that will get evaluated later.
instance ValidAcc a => Semigroup backend (AccVector a) where
    (+) (AccVector a1) (AccVector a2) = A.zipWith (Prelude.+) a1 a2

-- | I *think* each of the backends uses the same code to generate the Array,
-- but I'm not 100% sure about that or what the code looks like
mkAccVector :: SVector n a -> AccVector backend a
mkAccVector = ...

-- | If I'm right about the above,
-- it should be safe to convert between backends arbitrarily,
-- but not to convert between sizes arbitrarily
convertAccBackend :: AccVector backend1 n a -> AccVector backend2 n a
convertAccBackend = unsafeCoerce

-- | For each backend, we'll have a different method for evaluating the expression tree we've generated.
-- This should also convert the result from being stored on the GPU to being in main memory.
class ValidBackend (backend::Backend) where
    runAccVector :: AccVector backend n a -> SVector n a

instance ValidBackend Interpretter where
    runAccVector (AccVector a) = _ $ I.run a

instance ValidBackend CUDA where
    runAccVector (AccVector a) = _

I think something like this will work. But again, I haven't thought about it too hard so I could easily be overlooking something.

With the exception of the latter function, Accelerate functions are backend-agnostic so can be run on whatever backend is available, which is sort of nice. Unfortunately, Accelerate doesn't have working OpenCL support and more unfortunately it doesn't have cuBLAS blindings (there are some cuda 6.0 bindings here but it doesn't look like they't hook into accerlate without some work: https://github.com/bmsherman/cublas).

Certainly more bindings would be nice. And I suspect using a raw C FFI interface would be faster. But I think accelerate has enough current bindings to be useful, and it'd be a much better work/reward ratio than going straight to an FFI binding.

I also can't find any cuDNN bindings in Haskell. I'd imagine whatever gpu support you'd want would need to support these libraries, if only for the sake of HLearn. Unless of course someone wanted to do hand-rolled cuda / OpenCL code that could be compiled and called from the Haskell FFI.

One of the ideas of HLearn (which isn't very explicitly stated anywhere) is to not use bindings like cuDNN, but instead have all of the machine learning implemented natively in Haskell. I'm pretty unhappy with the state of current interfaces for machine learning libraries and don't want to get tied down into the existing way of doing things. One aspect of this is that I want HLearn's internals to be so simple that an undergrad cs student could fully grok everything. Part of this means each algorithm only gets implemented once in a generic form and can then be used on anything that supports linear algebra (so e.g. on main memory or on the GPU, the only modification being the type signature of the vector that stores the data).

o1lo01ol1o commented 8 years ago

I think something like this will work. But again, I haven't thought about it too hard so I could easily be overlooking something.

You're probably right, my Haskell newbishness makes it difficult to see too far ahead.

One of the ideas of HLearn (which isn't very explicitly stated anywhere) is to not use bindings like cuDNN, but instead have all of the machine learning implemented natively in Haskell. I'm pretty unhappy with the state of current interfaces for machine learning libraries and don't want to get tied down into the existing way of doing things.

This is what attracted me to Haskell (and then to SubHask) but after digging through accelerate code and benchmarks, I started to wonder how performant even batched mmults would be after accelerate compiles whatever its understanding of a given expression is. Admittedly of secondary importance to getting gpu support at all but one wants as much cake as one can get :)

mikeizbicki commented 8 years ago

I started to wonder how performant even batched mmults would be after accelerate compiles whatever its understanding of a given expression is.

I've also been curious and am looking forward to some benchmarks ;)

mikeizbicki / subhask

GPUVector from Accelerate Backend #33