shashi / ArrayMeta.jl

metaprogramming for Julia arrays
Other
13 stars 3 forks source link

What does it take to have a GPU backend? #5

Open ViralBShah opened 6 years ago

ViralBShah commented 6 years ago

cc @MikeInnes

shashi commented 6 years ago

It should just be a matter of starting small and making all the cases work, but I think Mike has already done this in Tokamak?

MikeInnes commented 6 years ago

I had some code at one point, it's not too difficult for basic stuff. If we're interested in this again I'll resurrect it.

ViralBShah commented 6 years ago

Only if we can reuse it meaningfully in other components such as Flux, Knet, etc.

SimonDanisch commented 6 years ago

Would be nice to have it to play around with it ;)

shashi commented 6 years ago

I think we should do it. It ought to have good return on investment.

SimonDanisch commented 6 years ago

If we want to make a project out of this, I would propose the following:

One of the challenges will be, to actually get out decent speed ups. If we just create some fusing framework without noticeable speed ups, we wouldn't win much besides throwing around fusion as a buzzword ;)

I think @maleadt @vchuravy just recently had some experience with how tiring it can be to get out actual speed ups on the GPU even if you're doing it manually.

The high performance C GPU kernels I know look pretty chaotic to me and I wouldn't know a straight mapping from a high level representation to those kernels right away.

So to familiarize oneself with this, one of the first things I would do is figuring out how a flexible and julian, high performance kernel can actually look like.

E.g. use functors + closures + unrolling + tiling functions to create kernels that can be scheduled differently just via the type parameters. This would let us explore the problem space and would have the added benefit, that we get high level primitives to write high performance kernels - so if the automatic code generation fails, one doesn't need to drop back to the absolute low level.

Then we should probably clarify our goal a bit more.

A first decent goal seems to be, to compile the softmax function.

So that user code like this:


function softmax!(y, x; dim=1)
    tmp = maximum(x, dim)
    y .= exp.(x .- tmp)
    sum!(tmp, y)
    y ./= sum(y, dim)
end

Will get turned into a performant kernel as in:

https://github.com/JuliaGPU/GPUArrays.jl/blob/5d373d0f20fba969f1956f4f9331099c8cb024da/src/nlib.jl#L129

I think this is also a good example that will demonstrate, whether a user actually likes to write this in einstein notation, or whatever tokamak came up with (not really sure what the high level tokamak code would look like ;)) or if people like to continue writing their code like the above.

When I find the time, I could start by finishing the linked high performance softmax kernel and rewrite it to be more julian while not loosing performance. This way, we can have a good baseline for further experiments. If @MikeInnes could come from the other side and get the tokamak softmax example working again, we could try to put those together and reach equal performance.

ViralBShah commented 6 years ago

That's a pretty good outline for what is needed.

MikeInnes commented 6 years ago

Yeah, so I think there's a "top down" and "bottom up" approach to this problem. The top down is to start with the high-level issues (What does the user write down? What transformations can we make?). The bottom up starts by just writing fast kernels by hand, and then gradually factoring out the differences. I think we need both, and that leads to a nice division of labour. Attacking softmax from both ends seems like a great starter problem.

For my part, I've largely been focused on CPU and distributed stuff because I understand the performance model well there. Of course, basic stuff like map/reduce fusion tends to give a speedup everywhere anyway, so things are still pretty backend-agnostic.

not really sure what the high level tokamak code would look like

To give a rough flavour, Tokamak basically gives you a sort of "array lambda calculus" where arrays are just functions. e.g. [i] -> xs[i] + 1 is the same as map(x -> x+1, xs). One nice thing is that it unifies nicely with normal vectorised array code, eg. you can happily write softmax(xs) = exp.(xs) ./ maximum(exp.(xs)) for softmax and it'll work, or mix and match the approaches. Then loop fusion is just function inlining, and so on.

vchuravy commented 6 years ago

I think @maleadt @vchuravy just recently had some experience with how tiring it can be to get out actual speed ups on the GPU even if you're doing it manually.

Actually we were rather pleasantly surprised. The effort we expanded was to understand the memory bandwidth constraint s and how that impacted performance. One of the lessons I learned is that data layout is one of the critical aspects of using Julia for GPUs and that's why I wrote a GPU aware StructsOfArray implementation https://github.com/jrevels/MixedModeBroadcastAD.jl/pull/14 (we are no longer using that for other reasons.)

@peterahrens might have some ideas as well.

SimonDanisch commented 6 years ago

Sorry for misrepresenting! ;) It just looked to me, as if it was quite a bit of work, with results that weren't immidiately obvious from the beginning (e.g. I'm not surprised, but also wouldn't have expected that optimizing the data layout would be a major part of the optimization).

So in other words, lessons you want to learn as early as possible!

[i] -> xs[i] + 1 is the same as map(x -> x+1, xs)

I like that on a first look! :)

shashi commented 6 years ago

I think this is also a good example that will demonstrate, whether a user actually likes to write this in einstein notation or whatever tokamak came up with (not really sure what the high level tokamak code would look like ;))

I think Tokamak's notation is more general purpose than the one in this package. But I can't find the readme I think it had. We should adapt it anyway since it's been thought out more to be in tune with the ML/GPU ecosystem.

or if people like to continue writing their code like the above.

I think, or at least this is my mental model in writing this PoC, user should write code like in your example, aka good old vectorized Julia. However the implementation of those vectorized operations (e.g. broadcast) can be done using Tokamak or ArrayMeta for the reason that it's easier and allows analyses. Fusing might require a @fuse syntactic macro or Casette which is prefixed to the entire softmax definition.

MikeInnes commented 6 years ago

I think it's also important that you can drop down and say "I know better" – if you happen to have a really good softmax implementation that we can't yet generate automatically, you should be able to use that and get the fusion benefits elsewhere.

Ideally, things like GPU reduce would be user-level primitives, and you'd be able to hint to Tokamak what kind of optimisations it can apply (e.g. you can fuse a map into the input or output of basically any kernel). I'm not sure what that looks like yet, though.

vchuravy commented 6 years ago

Yeah there are a couple of orthogonal problem:

  1. Representation of operations and data dependencies
  2. Scheduling and tiling
  3. Compilation and execution

Halide is about expressing schedules and tiling separately from the representation of operations, acknowledging that different hardware requires different scheduling and tiling.

Chapel as the interesting ideas of expressing 2 in forms of domains, also allowing

for tile in domain
end

Which is iteration over stencils and iteration order and parallelism is chosen depending on the HW. @ChrisRackauckas discussed with me once about a stencil compiler for parallel and GPU operators.