How to enable time check for all kernel calls during mnist train?

fsword73 commented 8 years ago

Hi, Hugh I know one kernel is very slow in mnist. The kernel occupies more than 50% workload in minist train. But I did not how to enable time check during mnist train. In other word, I have no method to find the exact kernel name. Is there any method to record every kernel's runtime? So that I can work on the top10 kernel's to optimize them . Basically I have an idea to speed up filter of forward. Curently 1 operation of sum+= images[x,y] * filter[s,t] need 2 buffer-load + 5 muladd. it can be optimized to non-buffer load, 1 local memory read + 1 muladd. I will have the final version next Monday.

hughperkins commented 8 years ago

Shall we consider big chip > 2000 cuda core only?

Reasonable point :-) I will check on Titan X

hughperkins commented 8 years ago

Hi fsword73, here are results using a Titan X:

current `master:

...
Using NVIDIA Corporation , OpenCL platform: NVIDIA CUDA
Using OpenCL device: GeForce GTX TITAN X
...
after epoch 1 11870 ms
 training loss: 18335.7
 train accuracy: 54072/60000 90.12%
test accuracy: 9775/10000 97.75%
after tests 155 ms
record epoch=1
wrote weights to file, filesize 173KB

after epoch 2 4342 ms
 training loss: 5642.55
 train accuracy: 58234/60000 97.0567%
test accuracy: 9841/10000 98.41%
after tests 155 ms
record epoch=2
wrote weights to file, filesize 173KB

after epoch 3 4378 ms
 training loss: 4197.25
 train accuracy: 58707/60000 97.845%
test accuracy: 9848/10000 98.48%
after tests 155 ms
record epoch=3
wrote weights to file, filesize 173KB

fsword73-kernel1:

after epoch 1 5960 ms
 training loss: 19150
 train accuracy: 53963/60000 89.9383%
test accuracy: 9758/10000 97.58%
after tests 170 ms
record epoch=1
wrote weights to file, filesize 173KB

after epoch 2 2701 ms
 training loss: 5671.87
 train accuracy: 58263/60000 97.105%
test accuracy: 9812/10000 98.12%
after tests 158 ms
record epoch=2
wrote weights to file, filesize 173KB

after epoch 3 2663 ms
 training loss: 4305.7
 train accuracy: 58685/60000 97.8083%
test accuracy: 9884/10000 98.84%
after tests 171 ms
record epoch=3
wrote weights to file, filesize 173KB

hughperkins commented 8 years ago

Your new kernels are nearly twice as fast. Thats really nice! :-)

hughperkins commented 8 years ago

We should probably try a couple of other geometries, before merging the new kernels to master branch. The benchmarks that everyone looks at currently are soumith's convnet-benchmarks A lot of GPU guys hang out there, from mxnet, theano, tensorflow, nervana, torch, caffe...

I think it'd be good to check the speed on the alexnet model.

There's a script to check this here: https://github.com/soumith/convnet-benchmarks/tree/master/deepcl

(Edit: seems like the benchmarking script is in DeepCL repo, here: https://github.com/hughperkins/DeepCL/blob/master/python/benchmarking/deepcl_benchmark2.py )

hughperkins commented 8 years ago

Ok. I tried running the alexnet geometries. Currently I'm getting:

calcGradWeights try kernel 6
   ... seems valid
BackpropWeightsAuto: kernel 6 this instance cant be used: OpenCL error, code: -5

... for all geometries

I tested as follows:

git clone https://github.com/soumith/convnet-benchmarks.git
cd convnet-benchmarks/deepcl
bash install.sh
source DeepCL/env/bin/activate
source DeepCL/dist/bin/activate.sh
cd DeepCL
git checkout fsword73-kernel1
cd build
cmake ..
make -j 4 install
cd ../python
python setup.py install
cd ../..
python DeepCL/python/benchmarking/deepcl_benchmark2.py soumith1
python DeepCL/python/benchmarking/deepcl_benchmark2.py soumith2
python DeepCL/python/benchmarking/deepcl_benchmark2.py soumith3
python DeepCL/python/benchmarking/deepcl_benchmark2.py soumith4
python DeepCL/python/benchmarking/deepcl_benchmark2.py soumith5

Thoughts?

fsword73 commented 8 years ago

I will take a quick look next Monday

hughperkins commented 8 years ago

Ok, cool. Sounds good :-)

fsword73 commented 8 years ago

It is very strange. kennel idx 6 is CPU not GPU.

if (idx == 6) { return new BackpropWeightsCpu(cl, layerDimensions); }

hughperkins commented 8 years ago

sure. but idx 4 and 5 are yours? https://github.com/hughperkins/DeepCL/blob/fsword73-kernel1/src/conv/BackpropWeights.cpp#L82-L85

    if(idx == 4) {
        cout << "fword73 kernel" << endl;
        return new BackpropWeightsFsword73(cl, layerDimensions);
    }
    if (idx == 5) {
        cout << "fword73 kernel BatchSize" << endl;
        return new BackpropWeightsFsword73_BatchSize(cl, layerDimensions);
    }

hughperkins commented 8 years ago

Oh, do you mean, why have a cpu one in there? Hmmm. I cant remember to be honest. does seem strange actually. Feel free to remove it.

fsword73 commented 8 years ago

I have completed 1st version of forward 3x3. it can reduce to 3.5cycle per SUM += image * filter which is very close to its theoretical peak. I hope to checkin in 2 weeks. The intern is still pedning so that I can not have much time on it recently.

hughperkins commented 8 years ago

Nice! Sounds good :-)

hughperkins commented 8 years ago

Hi. Looking through this (I've been kind of distracted on other things mostly...), it looks like:

your kernels give good results on certain geometries
on alexnet, they currently cause the OpenCL queue to crash, I think?
we're seeing an error on kernel 6, even though its cpu, since the cpu kernel still has to convert from opencl buffer to cpu main memory, via a queue, which has already crashed, hence an error. I think.

To what extent do alexnet geometries work for you?

fsword73 commented 8 years ago

Hi,

What is the command line to run alexnet?
CPU kerenel 6 is because backward.cpp. It shold be fixed in my local tree.
Expect to continue on it with more time next week.

hughperkins commented 8 years ago

What is the command line to run alexnet?

I run it from python... but that might complicate things ... to run it without needing python... hmmm... maybe use the manifestloader? https://github.com/hughperkins/DeepCL/blob/master/doc/Loaders.md#jpegs As far as the network itself, there are 5 layers, see figure 2 of www.cs.toronto.edu/~fritz/absps/imagenet.pdf , but an easier way to see the geometry is perhaps to look at https://github.com/soumith/convnet-benchmarks/blob/master/deepcl/deepcl_benchmark.py#L17-L53 (Note that alexnet normally uses stride 4 for the first layer, different from what is shown in this link)

CPU kerenel 6 is because backward.cpp. It shold be fixed in my local tree.

Ok

Expect to continue on it with more time next week.

Cool. Sounds good :-)

fsword73 commented 8 years ago

New backward/forward tiled mode are added.
1) next week : add 1x1 mode for backward. 2) Run alexnet

hughperkins commented 8 years ago

Coooolll... :-) Sounds good :-)

fsword73 commented 8 years ago

python deepcl_benchmark2.py soumith2

How to debug the error "MemoryError: bad array new length"?

Traceback (most recent call last): File "deepcl_benchmark2.py", line 284, in go(chosen_runs) File "deepcl_benchmark2.py", line 264, in go time_layer(num_epochs, label=label, batch_size=batch_size, net_string=net_string) File "deepcl_benchmark2.py", line 101, in time_layer net.forward( images ) File "NeuralNet.pyx", line 31, in PyDeepCL.NeuralNet.forward (G:\jianyang\deepcl_fork2\DeepCL\python\PyDeepCL.cxx:7850) MemoryError: bad array new length clblas teardown

hughperkins commented 8 years ago

Hmmm, it probably means, not enough memory to be honest... you could try a different layer, eg soumith3, or you could obtain access to a machine with more memory. I think it is the host that lacks memory, rather than the gpu, by the way, I think, fairly sure.

fsword73 commented 8 years ago

It seems so. I have to disable several Forward/backward/backpropweights. It can run.
I understand the geometry of alexnet. It is much easier to achieve 2 Insts per SUM+= ab. So in theory, the backward/forward/backpropweights of 3x128x128, 96C11, the cost will be images( 3x128x128) * 96 * 11 11 filter * (128 batch) / ( 2048 Cores @ 1Ghz ) * 2 = 71 ms

So this is the goal

merceyz commented 7 years ago

What is currently blocking the working parts of this from getting merged? / What isn't working?

hughperkins commented 7 years ago

I'm not sure. Do you want to take a look, and create a PR for each unique contribution? (use git 'cherry-pick' on fsword73's commits, perhaps?).

merceyz commented 7 years ago

I was hoping to get @fsword73 's attention and have him finish this.

As far as I can tell his changes are the kernel(s) and the timer. I merged the changes to the timer into my local version of DeepCL a long time ago and had no problems with it. I could create a PR for that if you'd like.

merceyz commented 7 years ago

GPU: AMD MSI R9 280x Netdef: 16c3z-tanh-mp2-32c3z-tanh-mp2-64c3z-tanh-mp2-128c3z-tanh-mp2-256c3z-tanh-mp2-512c3z-tanh-mp2-128n-tanh-128n-tanh-2n Input: 96x96x3 Image count: 94 757 BatchSize: 128

I extracted the kernels from https://github.com/hughperkins/DeepCL/blob/fsword73-kernel1/cl and used the BackpropWeightsFsword73* files from https://github.com/hughperkins/DeepCL/tree/fsword73-kernel1/src/conv (removed the string version of the kernel and used an ifstream to read it from the cl files during startup)

Epoch time (from epoch 2)

Before: 530 272 ms
After:  246 572 ms
~53% speedup

network overview The kernel(s) was the fastest on layer 2, 5, 11, and 24

On one of the layers the kernel failed to compile with the error -63 (CL_INVALID_GLOBAL_WORK_SIZE)

Full training log: Training log.txt

Kernel 5 is BackpropWeightsFsword73 Kernel 6 is BackpropWeightsFsword73_BatchSize

hughperkins / DeepCL

How to enable time check for all kernel calls during mnist train? #66