Crazy idea: make an nngraph 'optimizer'

hughperkins commented 9 years ago

Crazy idea: make an nngraph 'optimizer'.

I'm calling it crazy so I dont accidentally commit myself to doing something I then find wont work for reason x,y,z :-P Plus, it is a little way out there.

Thinking of creating an optimizer for nngraph, that takes a graph as input, and then culls edges where it can. eg, maybe replaces nn.Narrow with torch.Narrow, and removes the nn.Narrow node. Since I haven't actually implemented this yet, I dont know how possible this is going to be, and whether obstacles I meet are merely challenging, or are weeks of work.

I'm also thinking of something more general, and more gpu-specific, of implementing gmodule within clnn, and walking the graph within clnn, and joining some of the nodes together, during the forward/backward calls, eg replace a Linear + a Tanh by a single kernel that calls both, in one launch.

Overriding motivatoin for all of this is basically on certain gpus, I notice that the overhead of launching kernels, in char-rnn, appears to dominate the runtime. I havent used any particular diagnostic tools to confirm this, and perhaps I should, but I notice that I can increase the data size 20 times whilst the execution time only increases by aobut 10-20%, which suggests to me that it's not a calc/data issue, but plausibly linked to kernel launches.

hughperkins commented 9 years ago

So, managed to ~~stop~~ slow down compulsively refreshing http://github.com enough to actually get as far as backprop, for culling Narrows inside gModule. It turns out that removing the forward prop nn.Narrow is incredibly easy, because a node only ever has one output. But removing Narrows during the backprop, replacing nn.Narrow with torch.Narrows, turns out to be considerably more challenging. Typically, a node might feed, on the backwards prop, into multiple nn.Narrow nodes. That means need to launch kernels for copy anyway, which is what I'm trying to get rid of. Or else need to inject a tensor from each child into the kernel of the Narrow's parent, which is non-trivial.

hughperkins commented 9 years ago

Seems like to get this to work, would need to do something like get each nn layer to return the uncompiled, but already templated probably, kernel source code, through some kind of generic function, so they can then be stitched together somehow.

soumith commented 9 years ago

There are some people working on it :) but in different approaches.

@wickedfoo

nicolasvasilache commented 9 years ago

Hugh, my take on this is that the fusion problem is a traditional compiler problem resembling loop fusion. For most (all?) non trivial cases you have to look at the dependence analysis structure of your problem. A direction that I like is: http://mcl.csa.iisc.ernet.in/polymage.html, it has the right formalism but I doubt the tools are usable yet by non-researchers in the field (I may be wrong). In any case this is a super interesting project and very important too, one question is how much fusion is actually necessary vs what you can achieve with a properly tiled approach.

wickedfoo commented 9 years ago

@hughperkins I am sort of working on this problem, not so much on fusion of kernels themselves but on analyzing and scheduling a computation graph, and assigning computations to resources (CPUs, GPUs, remote machines) without requiring manual assignment, to yield better performance than a manual assignment or the normal scheduling of work that, say, :forward() or :backward() would yield.

However it effectively involves a rewrite of torch-nn/cunn/nngraph because the existing Torch nn framework does not provide enough information regarding read/write dependencies. Plus, Torch nn is written in a stateful OO manner and has no real guidelines when it comes to internal state or dependencies which causes many problems with trying to extract a computation graph and recompose it. The current framework I feel is very much at a technological dead-end and is not suited for a multi-machine world.

I'm not entirely convinced that fusion of simpler layers would yield that much performance either. Pretty much everything (for CUDA/GPU at least, very much not so for CPU since the Torch use of OpenMP is non-sensical and there's no vectorization at all) is bandwidth bound, not FLOP bound. This is a regime for which fusion would yield gains, but much of the heavy work is not in the pointwise modules, but in more complicated things like matrix multiplication, pooling or convolution. In order to fuse computations with these things, I think that everything would have to be written in a formulation like Polymage or Halide in order to fuse more complicated computations.

hughperkins commented 9 years ago

Hi nicolasvasilache and wickedfoo , I think that what you're doing is way more advanced than what I'm doing at the moment, so you should probalby ignore me for now ;-) I should probably read through what you guys are doing, and see how I can fit around that.

wickedfoo commented 9 years ago

Kernel launch overhead (I'm speaking for CUDA here, not OpenCL which I know nothing about :) ) is definitely significant the smaller the work performed is, as in RNNs. Fusion would help here, but the problem is that there are a wide variety of kernels in use, and being able to fuse any pairs of them or combinations of them is hard; especially in cutorch, where the size of the kernel CTA varies widely for various reasons. When the workload is so small too, it's also highly sensitive to CPU/GPU synchronization points (which stall the kernel launch pipeline).

One thing my solution is working on is to make it easier to avoid synchronization points (and mercilessly hunting down sync points that do exist).

The Torch tensor API also makes it hard to avoid synchronization points; e.g., :mean() is a sync point that copies a value back to the CPU, and there are lots of similar functions. Lazy evaluation is not in Torch's vocabulary. I'm still thinking of a way to address this, to keep intermediate results on the GPU where possible for reuse and perhaps conditional launch/evaluation. That's pretty far off though. It might have to require writing a new base tensor API and mapping the user-level Matlab-like API to that, but the underlying layer is lazy and fusion-friendly. All of this would be tons of work though.

hughperkins commented 9 years ago

The Torch tensor API also makes it hard to avoid synchronization points; e.g., :mean() is a sync point that copies a value back to the CPU, and there are lots of similar functions.

Yes, this is high on my list of suspicions too actually, as the bottleneck in opencl char-rnn. I think I'm going to start doing a bit more benchmarking first though instead of guessing again and again :-P

As far as addressing this particular issue, I think it's relatively easy actually. The technology already exists in cutorch , you see that parameter outOnDevice? I think an easy option could be to simply feed this parameter all the way back into the Torch api, like:

aTensor = torch.Tensor{3,5,2}
aScalar = aTensor:sum()
anotherTensor = aTensor:sum(true)

(edit: or maybe just detect that the output is a tensor, and use that:

aTensor = torch.Tensor{3,5,2}
resTensor = torch.Tensor()
resTensor:sum(aTensor)

) (hmmm... mind you ... then have to get this scalar back into torch somehow, but again, not insurmountable, but better get some benchmarks first...)

hughperkins commented 9 years ago

Started a project for benchmarking https://github.com/hughperkins/cltorch-benchmarking it's cltorch-specific for now, since it uses EasyCL, which is opencl-specific. Could be generalized though plausibly to cuda.

wickedfoo commented 9 years ago

Ha, I'm actually the person who wrote the THCReduceAll and added the outOnDevice code. Nothing uses it yet (at least, I haven't used it) :)

But, you'd have to have a system to allow other parts of code take those values as inputs, the ability to evaluate conditionals on the GPU and run a kernel based on it (or not; the kernel could early exit out if its device-local conditional flag wasn't set, or something). And, an ability to be able to chain together these expressions in Torch/Lua land, what that would look like, etc.

It's pretty clear to me that it's an entirely different language that would be exposed to the user, unless you wanted to do some pretty crazy compiler analysis.

hughperkins commented 9 years ago

Ha, I'm actually the person who wrote the THCReduceAll and added the outOnDevice code. Nothing uses it yet (at least, I haven't used it) :)

Haha, ok :-)

And, an ability to be able to chain together these expressions in Torch/Lua land, what that would look like, etc.

I guess that basically, anywhere there is a scalar input and output, there needs to be an option to provide alternatively a tensor input/output?

hughperkins commented 9 years ago

For the reduceall bit, i guess we can baby-step through this, eg starting by adding a lua-side parameter, like ..., returnAsTensor=false)

hughperkins commented 9 years ago

Hmmm, I reckon we should move the getDeviceScratchSpace bit out of ReduceAll, and into TensorMath. That way ReduceAll is fairly clean, and TensorMath can handle the outOnDevice argument that we receive from lua.

wickedfoo commented 9 years ago

Some additional scratch space would be needed if the output were eventually put into a tensor. The device scratch space is used for two-pass reductions in cutorch, not just the final result. It's not just 1 float, it's # of SMs * 4 floats or something like that.

This could probably be handled, by performing a temp allocation if the scratch space isn't available. It's the same strategy that Thrust uses, except with the scratch space I do not churn memory allocations (with the resultant sync on cudaFree).

hughperkins commented 9 years ago

I was thinking that, the tensor would just be a normal tensor argument, as for eg a:abs(b), comes from the lua side, dont need the scratch space in fact.

wickedfoo commented 9 years ago

The final output can be a tensor that comes from the Lua side, but the scratch space is used for more than that.

The scratch space size I require is dependent upon the specifics of the GPU hardware in use for two-pass reduction.

hughperkins commented 9 years ago

two-pass reduction needs scratchspace (at least in the current implementation), but I'm not sure why you need scratch space to store a standard tensor?

This is how I see tihs working:

a = torch.ClTensor(30,40):uniform()
b = torch.ClTensor()   -- b is a tensor
b:sum(a)    -- still a tensor, receives the result
a:div(b)      -- feed to next operation, which receives a tensor of dimension 1

(Edit: compared to currently, it would be:

a = torch.ClTensor(30,40):uniform()
b = a:sum()  -- a host-side scalar, copied from device, via scratch
a:div(b)

)

hughperkins commented 9 years ago

@wickedfoo This branch contains an implementation of b:sum(a), https://github.com/hughperkins/cltorch/tree/out-on-device . Test as follows:

  a = torch.ClTensor(20,30):uniform()
  b = torch.ClTensor()
  b:sum(a)
  print('b\n', b)
  print('a:sum()\n', a:sum())

Result:

Using NVIDIA Corporation platform: NVIDIA CUDA
Using device: GeForce 940M
b
     297.2885
[torch.ClTensor of size 1]

a:sum()
    297.28851318359

hughperkins commented 9 years ago

Did a small poc for a zero-dimensional tensor, at https://github.com/hughperkins/cltorch/tree/zerodim-tensor-poc . Seems like it doesnt cause any major fundamental conflicts:

  a = torch.ClTensor(torch.LongStorage())
  a:sum(torch.ClTensor({3,5,2}))
  print('a\n', a)
  a_float = torch.Storage(1)
  a_float:copy(a:storage())
  print('a_fl', a_float)

gives:

a
    [torch.ClTensor with no dimension]

a_fl    
 10
[torch.DoubleStorage of size 1]

Yes, it would be nicer if printing a zero-dimensional tensor displayed the actual value. As we can see, this is technically possible, in the worst case by copying via a float storage, and probably can be automated a bit.

(Edit: relevant commit: https://github.com/hughperkins/cltorch/commit/1d45998e99edc7fd6a8f558b59d97659fbab8616 )

hughperkins commented 9 years ago

added :s() method, which you can apply on a point tensor, and it will return the scalar value of that point tensor. Use like:

require 'cltorch'
a = torch.ClTensor()
a:sum(torch.ClTensor({3,5,4})
print('a:s()', a:s())

https://github.com/hughperkins/cltorch/commit/c785e8a221e89dbc5db04ec442cab2a7f0fdd0a7

hughperkins commented 9 years ago

Division by point tensor works now :-)

  c = torch.ClTensor(3,4):uniform()
  a = c:clone()
  print('a:div(a:sum())', a:div(a:sum()))

  c_sum = torch.ClTensor()
  c_sum:sum(c)
  print('c_sum', c_sum)
  print('c:div(c_sum)', c:div(c_sum))

Result:

a:div(a:sum())   0.1489  0.0821  0.0424  0.1332
 0.0527  0.1113  0.0893  0.0907
 0.0550  0.0545  0.0944  0.0453
[torch.ClTensor of size 3x4]

c_sum   [torch.ClTensor with no dimension]

c:div(c_sum)     0.1489  0.0821  0.0424  0.1332
 0.0527  0.1113  0.0893  0.0907
 0.0550  0.0545  0.0944  0.0453
[torch.ClTensor of size 3x4]

commits:

hughperkins commented 9 years ago

Think I've exhausted all the nngraph-agnostic optimizations, ie per-module improvements. Looking again at nngraph-level fusion. (Edit: PS, cleaned a bunch of my earlier posts from this thread, to tidy it up a bit) (Edit2: I guess I can start with forward-only Narrow fusion, and then look at other fusions later)

hughperkins commented 9 years ago

Just out of curiosity, if I could get nngraph to fuse various nodes, reckon anywhere could publish a paper for that, or basically pure engineering? If I have to guess, I guess: engineering, but it's mostly a guess, hence asking :-)

hughperkins commented 9 years ago

Update: kernel fusion is mostly working. Just got a few ... "edge"-cases :-P to work out.

soumith commented 9 years ago

@hughperkins that's exciting. i'm taking a look to see what you've done :)

Edit: I guess it hasn't been pushed yet.

hughperkins commented 9 years ago

Hmmm, yeah, I'm in two minds about pushing it until it demonstrably works on char-rnn, 1. in case I suddenly figure out a way of turning it into a paper, 2. I dont want eg Google suddenly adding some obvious bit to it, and then patenting it :-P

soumith commented 9 years ago

I think the cautiousness is definitely warranted (more for correctness and for research). I doubt Google would do that :D.

Another set of folks who already fuse kernels for similar stuff are Nervana in their Neon framework. They introduce "MOPs" which they automatically analyze/fuse. I never tried their stuff beyond simple trivial benchmarking.

hughperkins commented 9 years ago

Ok.

As far as papers... I suppose the thing that makes a paper a paper, rather than just a computer program, is that it is general enough to be applied to other areas. For example, group theory and map-reduce are both quite general concepts, and the key invention in for example map-reduce is not running a program on lots of computers, but creating a general, simple framework which can be applied in many situations.

Similarly, presumably saying "I wrote a program to reduce kernels in Torch" is not very paperable, it's kind of like something one could plausibly include in a masters thesis, whereas if I could abstract a key idea or two that made it possible to fuse the kernels, maybe that could become a paper, assuming it's original?

hughperkins commented 9 years ago

Hmmm, seems there are a bunch of existing papers actually:

(Edit: hmmm, the third one, http://www.researchgate.net/publication/220244769_Automatic_fusions_of_CUDA-GPU_kernels_for_parallel_map , is a pretty good description of what I'm doing for char-rnn , except it goes beyond where I am at the moment, since it's actually trying to choose appropriate sub-groups, to fuse, rather than just lumping as much stuff together into one giant molasses kernel :-P )

hughperkins commented 9 years ago

Since the Fousek et al paper covers everything I have to say, and then some, I've pushed my branch up to https://github.com/hughperkins/clnn/tree/fused-modules commits here: https://github.com/hughperkins/clnn/commits/fused-modules

The code is a bit of a mess right now ... Key concepts are:

scope for now is to fuse anything that can be converted to an Apply, that is to a per-element operation
- this includes: Tanh, Sigmoid, CAddTable, and CMulTable
- Linear is not included, so we will have two islands of mergeable Applies, separated by Linear modules, in a default, 2-layer char-rnn
originally there were some Narrows in char-rnn, and the Narrows both split some of the Applies, and are not themselves easily fusible, since very general
- however, a PR last week reordered the Narrows, to put the Applies together, and then swapped the Narrows for a Split https://github.com/karpathy/char-rnn/pull/68
- PR had the side-effect of reducing iteration time on cutorch by 20%, and on cpu by 30%, making this an easy merge :-)

That's it for high level concepts. In terms of how char-rnn looks, here is a picture of char-rnn prior to last week's PR. You can see that Nodes 16, 18 and 28 in this diagram split the Apply island:

Here is char-rnn after moving rearranging the Narrows, and changing to Split. You can see that there is a nice Apply island from nodes 16, 18, 19, down to node 3:

gmodule init fg

In terms of technical implementation, creating the dynamic kernels is handled by the UserKernel implementation I created for Sergey a couple of weeks ago https://github.com/hughperkins/cltorch#custom-user-kernels Then, a new network module nn.Apply sits on top of this.

The fusion process looks as follows:

receive a gmodule object. This will probably be received via a monkey-patched nn.gModule:cl() function
convert from gmodule to an 'ngh' nodegraph. Name to be improved later. Converted using https://github.com/hughperkins/clnn/blob/fused-modules/nodeGraphHelper.lua#L12
the ngh graphs are optimized for being easy to modify:
- added node.parents map
- removed node.data.mapindex, since redundant, and increases complexity
- removed the birectional entries in node.children, ie we have node.children[1], node.children[2], but not node.children[child1], etc ...
- added an id to each node.data object, which are more invariant than the node objects
all Apply type modules, such as CAddTable, Tanh etc are replaced by nn.Apply objects
pairs of adjacents Apply nodes are iteratively fused
finally the graph is converted back to a gmodule object, for normal processing

Tests are in https://github.com/hughperkins/clnn/blob/fused-modules/test-fusion.lua Run by doing:

th test-fusion.lua

Or for a specific test, eg:

TESTS=fusiontests.forward2inputs2 th test-fusion.lua

What's working:

convert gmodule to ngh graph
fusion (but: one known bug to fix, see below)
kernel generation
convert ngh graph to gmodule
running kernel, by calling :forward, and :updateGradInput, on gmodule

Known issues:

if one node n1 feeds into both of two nodes n2 and n3, if n2 fuses with n1, the fused node n2.n1 should now have two unique outputs: one being the original output from the original n1, one being the output of n1 followed by n2. Currently it continues to produce one output, being n2.n1, which is wrong

soumith commented 9 years ago

This is really nice work. I really like the fact that the fused graph is still somewhat in lua-land and won't have crazy debugging issues.

hughperkins commented 9 years ago

Thank-you! :-)

torch / nngraph

Crazy idea: make an nngraph 'optimizer' #60