hughperkins / DeepCL

OpenCL library to train deep convolutional neural networks
Mozilla Public License 2.0
867 stars 199 forks source link

How to enable time check for all kernel calls during mnist train? #66

Closed fsword73 closed 7 years ago

fsword73 commented 8 years ago

Hi, Hugh I know one kernel is very slow in mnist. The kernel occupies more than 50% workload in minist train. But I did not how to enable time check during mnist train. In other word, I have no method to find the exact kernel name. Is there any method to record every kernel's runtime? So that I can work on the top10 kernel's to optimize them . Basically I have an idea to speed up filter of forward. Curently 1 operation of sum+= images[x,y] * filter[s,t] need 2 buffer-load + 5 muladd. it can be optimized to non-buffer load, 1 local memory read + 1 muladd. I will have the final version next Monday.

hughperkins commented 8 years ago

Shall we consider big chip > 2000 cuda core only?

Reasonable point :-) I will check on Titan X

hughperkins commented 8 years ago

Hi fsword73, here are results using a Titan X:

current `master:

...
Using NVIDIA Corporation , OpenCL platform: NVIDIA CUDA
Using OpenCL device: GeForce GTX TITAN X
...
after epoch 1 11870 ms
 training loss: 18335.7
 train accuracy: 54072/60000 90.12%
test accuracy: 9775/10000 97.75%
after tests 155 ms
record epoch=1
wrote weights to file, filesize 173KB

after epoch 2 4342 ms
 training loss: 5642.55
 train accuracy: 58234/60000 97.0567%
test accuracy: 9841/10000 98.41%
after tests 155 ms
record epoch=2
wrote weights to file, filesize 173KB

after epoch 3 4378 ms
 training loss: 4197.25
 train accuracy: 58707/60000 97.845%
test accuracy: 9848/10000 98.48%
after tests 155 ms
record epoch=3
wrote weights to file, filesize 173KB

fsword73-kernel1:

after epoch 1 5960 ms
 training loss: 19150
 train accuracy: 53963/60000 89.9383%
test accuracy: 9758/10000 97.58%
after tests 170 ms
record epoch=1
wrote weights to file, filesize 173KB

after epoch 2 2701 ms
 training loss: 5671.87
 train accuracy: 58263/60000 97.105%
test accuracy: 9812/10000 98.12%
after tests 158 ms
record epoch=2
wrote weights to file, filesize 173KB

after epoch 3 2663 ms
 training loss: 4305.7
 train accuracy: 58685/60000 97.8083%
test accuracy: 9884/10000 98.84%
after tests 171 ms
record epoch=3
wrote weights to file, filesize 173KB
hughperkins commented 8 years ago

Your new kernels are nearly twice as fast. Thats really nice! :-)

hughperkins commented 8 years ago

We should probably try a couple of other geometries, before merging the new kernels to master branch. The benchmarks that everyone looks at currently are soumith's convnet-benchmarks A lot of GPU guys hang out there, from mxnet, theano, tensorflow, nervana, torch, caffe...

I think it'd be good to check the speed on the alexnet model.

There's a script to check this here: https://github.com/soumith/convnet-benchmarks/tree/master/deepcl

(Edit: seems like the benchmarking script is in DeepCL repo, here: https://github.com/hughperkins/DeepCL/blob/master/python/benchmarking/deepcl_benchmark2.py )

hughperkins commented 8 years ago

Ok. I tried running the alexnet geometries. Currently I'm getting:

calcGradWeights try kernel 6
   ... seems valid
BackpropWeightsAuto: kernel 6 this instance cant be used: OpenCL error, code: -5

... for all geometries

I tested as follows:

git clone https://github.com/soumith/convnet-benchmarks.git
cd convnet-benchmarks/deepcl
bash install.sh
source DeepCL/env/bin/activate
source DeepCL/dist/bin/activate.sh
cd DeepCL
git checkout fsword73-kernel1
cd build
cmake ..
make -j 4 install
cd ../python
python setup.py install
cd ../..
python DeepCL/python/benchmarking/deepcl_benchmark2.py soumith1
python DeepCL/python/benchmarking/deepcl_benchmark2.py soumith2
python DeepCL/python/benchmarking/deepcl_benchmark2.py soumith3
python DeepCL/python/benchmarking/deepcl_benchmark2.py soumith4
python DeepCL/python/benchmarking/deepcl_benchmark2.py soumith5

Thoughts?

fsword73 commented 8 years ago

I will take a quick look next Monday

hughperkins commented 8 years ago

Ok, cool. Sounds good :-)

fsword73 commented 8 years ago

It is very strange. kennel idx 6 is CPU not GPU.

if (idx == 6) { return new BackpropWeightsCpu(cl, layerDimensions); }

hughperkins commented 8 years ago

sure. but idx 4 and 5 are yours? https://github.com/hughperkins/DeepCL/blob/fsword73-kernel1/src/conv/BackpropWeights.cpp#L82-L85

    if(idx == 4) {
        cout << "fword73 kernel" << endl;
        return new BackpropWeightsFsword73(cl, layerDimensions);
    }
    if (idx == 5) {
        cout << "fword73 kernel BatchSize" << endl;
        return new BackpropWeightsFsword73_BatchSize(cl, layerDimensions);
    }
hughperkins commented 8 years ago

Oh, do you mean, why have a cpu one in there? Hmmm. I cant remember to be honest. does seem strange actually. Feel free to remove it.

fsword73 commented 8 years ago

I have completed 1st version of forward 3x3. it can reduce to 3.5cycle per SUM += image * filter which is very close to its theoretical peak. I hope to checkin in 2 weeks. The intern is still pedning so that I can not have much time on it recently.

hughperkins commented 8 years ago

Nice! Sounds good :-)

hughperkins commented 8 years ago

Hi. Looking through this (I've been kind of distracted on other things mostly...), it looks like:

To what extent do alexnet geometries work for you?

fsword73 commented 8 years ago

Hi,

  1. What is the command line to run alexnet?
  2. CPU kerenel 6 is because backward.cpp. It shold be fixed in my local tree.
  3. Expect to continue on it with more time next week.
hughperkins commented 8 years ago
  1. What is the command line to run alexnet?

I run it from python... but that might complicate things ... to run it without needing python... hmmm... maybe use the manifestloader? https://github.com/hughperkins/DeepCL/blob/master/doc/Loaders.md#jpegs As far as the network itself, there are 5 layers, see figure 2 of www.cs.toronto.edu/~fritz/absps/imagenet.pdf , but an easier way to see the geometry is perhaps to look at https://github.com/soumith/convnet-benchmarks/blob/master/deepcl/deepcl_benchmark.py#L17-L53 (Note that alexnet normally uses stride 4 for the first layer, different from what is shown in this link)

  1. CPU kerenel 6 is because backward.cpp. It shold be fixed in my local tree.

Ok

  1. Expect to continue on it with more time next week.

Cool. Sounds good :-)

fsword73 commented 8 years ago

New backward/forward tiled mode are added.
1) next week : add 1x1 mode for backward. 2) Run alexnet

hughperkins commented 8 years ago

Coooolll... :-) Sounds good :-)

fsword73 commented 8 years ago

python deepcl_benchmark2.py soumith2

How to debug the error "MemoryError: bad array new length"?

Traceback (most recent call last): File "deepcl_benchmark2.py", line 284, in go(chosen_runs) File "deepcl_benchmark2.py", line 264, in go time_layer(num_epochs, label=label, batch_size=batch_size, net_string=net_string) File "deepcl_benchmark2.py", line 101, in time_layer net.forward( images ) File "NeuralNet.pyx", line 31, in PyDeepCL.NeuralNet.forward (G:\jianyang\deepcl_fork2\DeepCL\python\PyDeepCL.cxx:7850) MemoryError: bad array new length clblas teardown

hughperkins commented 8 years ago

Hmmm, it probably means, not enough memory to be honest... you could try a different layer, eg soumith3, or you could obtain access to a machine with more memory. I think it is the host that lacks memory, rather than the gpu, by the way, I think, fairly sure.

fsword73 commented 8 years ago

It seems so. I have to disable several Forward/backward/backpropweights. It can run.
I understand the geometry of alexnet. It is much easier to achieve 2 Insts per SUM+= ab. So in theory, the backward/forward/backpropweights of 3x128x128, 96C11, the cost will be images( 3x128x128) * 96 * 11 11 filter * (128 batch) / ( 2048 Cores @ 1Ghz ) * 2 = 71 ms

So this is the goal

merceyz commented 7 years ago

What is currently blocking the working parts of this from getting merged? / What isn't working?

hughperkins commented 7 years ago

I'm not sure. Do you want to take a look, and create a PR for each unique contribution? (use git 'cherry-pick' on fsword73's commits, perhaps?).

merceyz commented 7 years ago

I was hoping to get @fsword73 's attention and have him finish this.

As far as I can tell his changes are the kernel(s) and the timer. I merged the changes to the timer into my local version of DeepCL a long time ago and had no problems with it. I could create a PR for that if you'd like.

merceyz commented 7 years ago

GPU: AMD MSI R9 280x Netdef: 16c3z-tanh-mp2-32c3z-tanh-mp2-64c3z-tanh-mp2-128c3z-tanh-mp2-256c3z-tanh-mp2-512c3z-tanh-mp2-128n-tanh-128n-tanh-2n Input: 96x96x3 Image count: 94 757 BatchSize: 128

I extracted the kernels from https://github.com/hughperkins/DeepCL/blob/fsword73-kernel1/cl and used the BackpropWeightsFsword73* files from https://github.com/hughperkins/DeepCL/tree/fsword73-kernel1/src/conv (removed the string version of the kernel and used an ifstream to read it from the cl files during startup)

Epoch time (from epoch 2)

Before: 530 272 ms
After:  246 572 ms
~53% speedup

network overview The kernel(s) was the fastest on layer 2, 5, 11, and 24

On one of the layers the kernel failed to compile with the error -63 (CL_INVALID_GLOBAL_WORK_SIZE)

Full training log: Training log.txt

Kernel 5 is BackpropWeightsFsword73 Kernel 6 is BackpropWeightsFsword73_BatchSize