Open mdda opened 10 years ago
From what I see in jpcnn
, there's a lot of machinery in jpcnn.js
that's duplicative of what you have in convnet.js
.
Most of the calculations are still in javascript (in particular, it seems that all the WebGL results seem to be brought back to javascript Arrays, rather than being held in GPU memory for the next layer).
The two big sets of WebGL'd stuff are 'GEMM' and 'max' / 'maxPatch', with the GEMM stuff having a two versions : (a) one that delay with (n,n) matrices of Float32s, and (b) one where the (n,n) matrices are represented instead as (n/4,n/4) matrices of (r,g,b,a:Float32). Given the effort spent compacting/extracting from the x4 format, I suspect that it's a decent win to deal with the data packed in this way within the GPU (though doing this for convnet.js
may be better left for v2).
The array packing issues are particularly tricky issue for convnet.js
, since you've already got everything nicely packed into 1-dimensional arrays (buffers), with the various strides & offsets set up for packing multiple matrices within the array. So it would be a pity to have to re-copy everything into the format expected by jpcnn
. (Another issue : convnet.js
is working with Float64Array, and WebGL is only going to do Float32...)
In terms of WebGL machinery required, the utils/webgl.js does most of the gl-specifics. Do you know whether this originated at JetPac?
Hi Martin, I think this is all from JetPac, but Pete is now at Google. By the way, when I hacked around with JPCNN I was able to get exact same outputs from ConvLayer and jpcnn's GEMM, and also note that there is a trivial way to convert between them, because we both store values in the same packing order in the 1D array:
var VolToBuffer = function(vol) {
var buf = new Buffer([1, vol.sx, vol.sy, vol.depth]);
buf._data = vol.w;
return buf;
}
var BufferToVol = function(buf) {
var dims = buf._dims; // sigh
console.assert(dims._dims[0] === 1); // dim 0 is num, but convnetjs has batches of 1
var vol = new convnetjs.Vol(dims._dims[1], dims._dims[2], dims._dims[3]);
vol.w = buf._data;
return vol;
}
and I can also get identical outputs from Conv layer and from GeMM. Taking a randomly initialized conv layer, here is how it's achieved: Converting Conv weights and biases to dims:
var opt = { in_sx:256, in_sy:256, in_depth:3, sx:7, filters:256, stride: 3, pad: 0 };
var layer = new convnetjs.ConvLayer(opt);
// copy convnetjs kernels to single buffer
var fvals = opt.sx * opt.sx * opt.in_depth;
var nf = layer.filters.length;
JPkernels = new Buffer([fvals, opt.filters]);
JPbiases = new Buffer([1, opt.filters]);
var n = 0;
for(var j=0;j<fvals;j++) {
for(var i=0;i<nf;i++) {
JPkernels._data[n] = layer.filters[i].w[j];
n++;
}
}
layer.biases.w[0] = Math.random()-0.5;
layer.biases.w[1] = Math.random()-0.5;
layer.biases.w[2] = Math.random()-0.5;
JPbiases = VolToBuffer(layer.biases);
and then the two are equivalent, when vol
is the input volume:
v0 = layer.forward(vol); // convolve in JS
buf = VolToBuffer(vol);
b0 = matrixCorrelate(buf, JPkernels, 7, opt.filters, 3);
matrixAddInplace(b0, JPbiases, 1.0);
v1 = BufferToVol(b0);
so v0 = v1
but the latter is much faster with webgl magic :) So one option is to implement a forward_GPU
which translates everything from Vols to Bufs and then backwards in backward_GPU
. There would be a bit of an overhead with filling up the kernel buffer every forward pass, but maybe this can be memoized so if the kernels were not updated with a backward pass, they dont need to be recomputed.
So I hacked on this a bit today and created a second target convnet-webgl.js, which is the same build as vanially convnetjs, but also includes jpcnn and it overwrites ConvLayer with a WebGL version. (but it backs up the old ConvLayer into ConvLayerCPU). I also wrote jasmin tests to verify that it returns same result (both forward and backward pass). It's not commited anywhere but temporarily I put it up here:
http://cs.stanford.edu/people/karpathy/convnetjs/build/convnet-webgl.js
you can ctrl+f to it if you search for, for example syncKernels_
function which copies over convnetjs filters/biases to jpcnn representation. I also put up a speedtest here:
http://cs.stanford.edu/people/karpathy/convnetjs/demo/speedtest.html
this runs the CPU and then the GPU version for 10 iterations, following exactly the setup for layer L1 in convnet-benchmarks:
The issue is that somehow it's unexpectedly slow! I get average running time for CPU version of 1600ms, and this is giving me 336ms with WebGL, a speedup of only ~5x. I expected to see much more dramatic improvements. Extrapolating this time into a batch of 128 examples, as seen in convnet-benchmarks gives ~40 seconds, while Caffe gets 2 seconds.
I'm not sure why it's that slow. I think the issue is that Caffe does the whole batch a single time on the GPU, while here this is doing one example each (since ConvNetJS works on batches of 1 by default). Another issue might be that I have a super-crappy GPU on my machine, but a friend who has a newer GPU (GTX660) isn't getting much faster either (about 300ms). I'll have to stare at this a bit more.
I'm guessing that you're right on the single batch vs 128 at once as being a big source of the speed difference.
There's also the possibility (which I haven't checked out) that the convolution matrix could be encouraged to be stored in more 'local' GPU memory in WebGL (local accesses much quicker than 'global' ones, though have to be explicit on GPU).
Similarly, the rows/columns of the matrices could be laid out so that they were each sequential in memory (transpose the convolution matrix?) so that SIMD loads pull them in in blocks.
(I'm sure there's other stuff too : like the (r,g,b,a) packing that JPCNN is capable of - there's probably a good reason they included it)
Sorry for not having responded earlier : I've been a bit snowed under with 'stuff' here.
@Maratyszcza do you think that it could take a speedup using fusion.js?
@bhack I am not related to fusion.js project.
If you mean Furious.js, it does not aim at small matrix manipulations.
Yes sorry I meant Furious.js. So your effort "blis for the web" it was not related to Furious?
@bhack It is intended to be integrated into Furious.js, but it is not there yet. And in any case, it would only help for large matrices/NDArrays
@karpathy Just wondering if there's been any progress on this, as I haven't been able to find much else about it online.
I'd like to use your deep q learning code to try and train an agent on Go, as most of the necessary additional code is already available online in javascript, but I'm not sure how slow training would be without the speedup of a gpu.
I'm thinking about pulling together all the pieces on this topic (GPU acceleration via WebGL) into a standalone library (possibly using the jetpac code as a starting point) for accelerating deep learning and other machine learning applications in the browser.
My hope is to provide a base for libraries like this one (and new entrants) to build on, and (hopefully) get performance gains that approach those found in Caffe.
Do people think this could be useful?
@waylonflinn Definitely. GPU acceleration right from the browser could lead to some very cool and useful demo pages that don't need any extra setup to run - not to mention the wide applicability if you didn't tie it to ConvNetJS specifically.
@ssampang Glad to hear! I created a repository to track progress. No code there yet, but expect a basic implementation of gemm
in the next couple of days!
I also have an update on the GPU gemm
code from the Deep Belief SDK. It looks like it doesn't run in Chrome anymore (and possibly has issues in a few other browsers). If anyone is interested in troubleshooting or contributing a fix, you can track that in issue 57 of their repository. In the meantime I'm working from an alternate base implementation.
Wanted to post a quick update here, for interested parties. I've fully converted the jetpac demo to use weblas, producing close to a 2x speedup. The GEMM
function, in particular, is about 4x faster.
The key to this speedup was a combination of the factors that @mdda outlines above. Namely: transposing the second matrix, full packing of RGBA elements and using a single GPU instruction (dot
) to process four elements at a time.
A good next step here might be to convert the work @karpathy links to above to make use of this, and see if more impressive speedups ensue. If no one else steps in to take up the torch, I'll likely start on it in the next few days.
There's been a release of a library for converting JS functions to GPU calculations with fallbacks to JS which could make it pretty easy to have complex matrix math done on the GPU while not having to worry about doing it in both GLSL and JS.
hi guys, i am really interested in this. what is the current state?
Hey everyone. I'm about to resume development on my convolutional neural net library powered by weblas (an in browser computation library powered by the GPU). Weblas was about five to ten times faster than gpu.js on the relevant functions last time I checked.
Library is here: https://github.com/waylonflinn/webnn
I've made some experiments with network I have. Numbers are confusing:
Chrome without webgl: 44ms/tick
Firefox without webgl: 100ms/tick
Chrome with webgl: 72ms/tick
Firefox with webgl: 42ms/tick
o_O Where 'tick' is essentially a little bit of game tick logic and two forward passes for each bot.
(this is a continuation of the discussion started in #11, so it can be 'closed' cleanly)
Hmm - that is a lot of machinery to include 100-200 lines of WebGL. The least-intrusive method for including the end result (i.e. what the client sees) would be to have a separate
convnet.webgl.min.js
which, if it's there, sets up a webgl flag for the regularconvnet.min.js
to call into - or even overwrite the re-implemented methods themselves.On the source side, however, I've got to think there's a more direct way of making the BLAS.js code ready-to-use. I also think it makes sense to go the BLAS-compatible route, since it's a standard, and one avoids having to continuously re-invent the wheel... I'll have a poke around for a cleaner set of includes.
All the Best Martin :-)