WebGL implementation of calculations

mdda commented 10 years ago

(this is a continuation of the discussion started in #11, so it can be 'closed' cleanly)

... jpcnn, which in turn relies on underscore ...

Hmm - that is a lot of machinery to include 100-200 lines of WebGL. The least-intrusive method for including the end result (i.e. what the client sees) would be to have a separate convnet.webgl.min.js which, if it's there, sets up a webgl flag for the regular convnet.min.js to call into - or even overwrite the re-implemented methods themselves.

On the source side, however, I've got to think there's a more direct way of making the BLAS.js code ready-to-use. I also think it makes sense to go the BLAS-compatible route, since it's a standard, and one avoids having to continuously re-invent the wheel... I'll have a poke around for a cleaner set of includes.

All the Best Martin :-)

mdda commented 10 years ago

From what I see in jpcnn, there's a lot of machinery in jpcnn.js that's duplicative of what you have in convnet.js.

Most of the calculations are still in javascript (in particular, it seems that all the WebGL results seem to be brought back to javascript Arrays, rather than being held in GPU memory for the next layer).

The two big sets of WebGL'd stuff are 'GEMM' and 'max' / 'maxPatch', with the GEMM stuff having a two versions : (a) one that delay with (n,n) matrices of Float32s, and (b) one where the (n,n) matrices are represented instead as (n/4,n/4) matrices of (r,g,b,a:Float32). Given the effort spent compacting/extracting from the x4 format, I suspect that it's a decent win to deal with the data packed in this way within the GPU (though doing this for convnet.js may be better left for v2).

The array packing issues are particularly tricky issue for convnet.js, since you've already got everything nicely packed into 1-dimensional arrays (buffers), with the various strides & offsets set up for packing multiple matrices within the array. So it would be a pity to have to re-copy everything into the format expected by jpcnn. (Another issue : convnet.js is working with Float64Array, and WebGL is only going to do Float32...)

In terms of WebGL machinery required, the utils/webgl.js does most of the gl-specifics. Do you know whether this originated at JetPac?

karpathy commented 10 years ago

Hi Martin, I think this is all from JetPac, but Pete is now at Google. By the way, when I hacked around with JPCNN I was able to get exact same outputs from ConvLayer and jpcnn's GEMM, and also note that there is a trivial way to convert between them, because we both store values in the same packing order in the 1D array:

var VolToBuffer = function(vol) {
  var buf = new Buffer([1, vol.sx, vol.sy, vol.depth]);
  buf._data = vol.w;
  return buf;
}

var BufferToVol = function(buf) {
  var dims = buf._dims; // sigh
  console.assert(dims._dims[0] === 1); // dim 0 is num, but convnetjs has batches of 1
  var vol = new convnetjs.Vol(dims._dims[1], dims._dims[2], dims._dims[3]);
  vol.w = buf._data;
  return vol;
}

and I can also get identical outputs from Conv layer and from GeMM. Taking a randomly initialized conv layer, here is how it's achieved: Converting Conv weights and biases to dims:

var opt = { in_sx:256, in_sy:256, in_depth:3, sx:7, filters:256, stride: 3, pad: 0 };
var layer = new convnetjs.ConvLayer(opt);

// copy convnetjs kernels to single buffer
var fvals = opt.sx * opt.sx * opt.in_depth;
var nf = layer.filters.length;
JPkernels = new Buffer([fvals, opt.filters]);
JPbiases = new Buffer([1, opt.filters]);
var n = 0;
for(var j=0;j<fvals;j++) {
  for(var i=0;i<nf;i++) {
    JPkernels._data[n] = layer.filters[i].w[j];
    n++;
  }
}
layer.biases.w[0] = Math.random()-0.5;
layer.biases.w[1] = Math.random()-0.5;
layer.biases.w[2] = Math.random()-0.5;
JPbiases = VolToBuffer(layer.biases);

and then the two are equivalent, when vol is the input volume:

v0 = layer.forward(vol); // convolve in JS

buf = VolToBuffer(vol);
b0 = matrixCorrelate(buf, JPkernels, 7, opt.filters, 3);
matrixAddInplace(b0, JPbiases, 1.0);
v1 = BufferToVol(b0);

so v0 = v1 but the latter is much faster with webgl magic :) So one option is to implement a forward_GPU which translates everything from Vols to Bufs and then backwards in backward_GPU. There would be a bit of an overhead with filling up the kernel buffer every forward pass, but maybe this can be memoized so if the kernels were not updated with a backward pass, they dont need to be recomputed.

karpathy commented 10 years ago

So I hacked on this a bit today and created a second target convnet-webgl.js, which is the same build as vanially convnetjs, but also includes jpcnn and it overwrites ConvLayer with a WebGL version. (but it backs up the old ConvLayer into ConvLayerCPU). I also wrote jasmin tests to verify that it returns same result (both forward and backward pass). It's not commited anywhere but temporarily I put it up here:

http://cs.stanford.edu/people/karpathy/convnetjs/build/convnet-webgl.js

you can ctrl+f to it if you search for, for example syncKernels_ function which copies over convnetjs filters/biases to jpcnn representation. I also put up a speedtest here:

http://cs.stanford.edu/people/karpathy/convnetjs/demo/speedtest.html

this runs the CPU and then the GPU version for 10 iterations, following exactly the setup for layer L1 in convnet-benchmarks:

The issue is that somehow it's unexpectedly slow! I get average running time for CPU version of 1600ms, and this is giving me 336ms with WebGL, a speedup of only ~5x. I expected to see much more dramatic improvements. Extrapolating this time into a batch of 128 examples, as seen in convnet-benchmarks gives ~40 seconds, while Caffe gets 2 seconds.

I'm not sure why it's that slow. I think the issue is that Caffe does the whole batch a single time on the GPU, while here this is doing one example each (since ConvNetJS works on batches of 1 by default). Another issue might be that I have a super-crappy GPU on my machine, but a friend who has a newer GPU (GTX660) isn't getting much faster either (about 300ms). I'll have to stare at this a bit more.

mdda commented 10 years ago

I'm guessing that you're right on the single batch vs 128 at once as being a big source of the speed difference.

There's also the possibility (which I haven't checked out) that the convolution matrix could be encouraged to be stored in more 'local' GPU memory in WebGL (local accesses much quicker than 'global' ones, though have to be explicit on GPU).

Similarly, the rows/columns of the matrices could be laid out so that they were each sequential in memory (transpose the convolution matrix?) so that SIMD loads pull them in in blocks.

(I'm sure there's other stuff too : like the (r,g,b,a) packing that JPCNN is capable of - there's probably a good reason they included it)

Sorry for not having responded earlier : I've been a bit snowed under with 'stuff' here.

bhack commented 9 years ago

@Maratyszcza do you think that it could take a speedup using fusion.js?

Maratyszcza commented 9 years ago

@bhack I am not related to fusion.js project.

If you mean Furious.js, it does not aim at small matrix manipulations.

bhack commented 9 years ago

Yes sorry I meant Furious.js. So your effort "blis for the web" it was not related to Furious?

Maratyszcza commented 9 years ago

@bhack It is intended to be integrated into Furious.js, but it is not there yet. And in any case, it would only help for large matrices/NDArrays

ssampang commented 8 years ago

@karpathy Just wondering if there's been any progress on this, as I haven't been able to find much else about it online.

I'd like to use your deep q learning code to try and train an agent on Go, as most of the necessary additional code is already available online in javascript, but I'm not sure how slow training would be without the speedup of a gpu.

waylonflinn commented 8 years ago

I'm thinking about pulling together all the pieces on this topic (GPU acceleration via WebGL) into a standalone library (possibly using the jetpac code as a starting point) for accelerating deep learning and other machine learning applications in the browser.

My hope is to provide a base for libraries like this one (and new entrants) to build on, and (hopefully) get performance gains that approach those found in Caffe.

Do people think this could be useful?

ssampang commented 8 years ago

@waylonflinn Definitely. GPU acceleration right from the browser could lead to some very cool and useful demo pages that don't need any extra setup to run - not to mention the wide applicability if you didn't tie it to ConvNetJS specifically.

waylonflinn commented 8 years ago

@ssampang Glad to hear! I created a repository to track progress. No code there yet, but expect a basic implementation of gemm in the next couple of days!

weblas

I also have an update on the GPU gemm code from the Deep Belief SDK. It looks like it doesn't run in Chrome anymore (and possibly has issues in a few other browsers). If anyone is interested in troubleshooting or contributing a fix, you can track that in issue 57 of their repository. In the meantime I'm working from an alternate base implementation.

waylonflinn commented 8 years ago

Wanted to post a quick update here, for interested parties. I've fully converted the jetpac demo to use weblas, producing close to a 2x speedup. The GEMM function, in particular, is about 4x faster.

The key to this speedup was a combination of the factors that @mdda outlines above. Namely: transposing the second matrix, full packing of RGBA elements and using a single GPU instruction (dot) to process four elements at a time.

A good next step here might be to convert the work @karpathy links to above to make use of this, and see if more impressive speedups ensue. If no one else steps in to take up the torch, I'll likely start on it in the next few days.

RangerMauve commented 8 years ago

There's been a release of a library for converting JS functions to GPU calculations with fallbacks to JS which could make it pretty easy to have complex matrix math done on the GPU while not having to worry about doing it in both GLSL and JS.

haehn commented 8 years ago

hi guys, i am really interested in this. what is the current state?

waylonflinn commented 8 years ago

Hey everyone. I'm about to resume development on my convolutional neural net library powered by weblas (an in browser computation library powered by the GPU). Weblas was about five to ten times faster than gpu.js on the relevant functions last time I checked.

Library is here: https://github.com/waylonflinn/webnn

soswow commented 7 years ago

I've made some experiments with network I have. Numbers are confusing:

Chrome without webgl: 44ms/tick
Firefox without webgl: 100ms/tick
Chrome with webgl: 72ms/tick
Firefox with webgl: 42ms/tick

o_O Where 'tick' is essentially a little bit of game tick logic and two forward passes for each bot.

karpathy / convnetjs

WebGL implementation of calculations #13