hughperkins / DeepCL

OpenCL library to train deep convolutional neural networks
Mozilla Public License 2.0
869 stars 200 forks source link

How to enable time check for all kernel calls during mnist train? #66

Closed fsword73 closed 7 years ago

fsword73 commented 8 years ago

Hi, Hugh I know one kernel is very slow in mnist. The kernel occupies more than 50% workload in minist train. But I did not how to enable time check during mnist train. In other word, I have no method to find the exact kernel name. Is there any method to record every kernel's runtime? So that I can work on the top10 kernel's to optimize them . Basically I have an idea to speed up filter of forward. Curently 1 operation of sum+= images[x,y] * filter[s,t] need 2 buffer-load + 5 muladd. it can be optimized to non-buffer load, 1 local memory read + 1 muladd. I will have the final version next Monday.

hughperkins commented 8 years ago

Hi fsword73 Yes, there is! You cna have a look at https://github.com/hughperkins/cltorch#opencl-profiling

"OpenCL Profiling

"OpenCL natively provides facilities to measure the execution time of kernels, without needing to call cltorch.finish() or similar first, using clGetEventProfilingInfo. In cltorch, you dont need to know how this works ;-) Simply call, at the start of your code:

cltorch.setProfiling(1)

"Then, after running the piece of code under scrutiny, simply call:

cltorch.dumpProfiling()

"Timings are cumulative across multiple calls to the same kernel.

"DumpTimings

"This uses the wall-clock times to measure the elapsed time in different sections of cltorch code. The way it works is, each time the cltorch c++ code calls StatefulTimer::instance()->timeCheck("some status"), the wall-clock time since the last call to ->timeCheck() will be added to the cumulative time for some status. You can pass any status as a string. Then, after running the piece of code under the scrutiny, in your Lua program, simply call cltorch.dumpTimings() to dump these cumulative timings.

"Update: please first call cltorch.setEnableTiming(true) to enable collection of timing information. This is global across all devices."

To what extent does this meet your requirements?

Basically I have an idea to speed up filter of forward. Curently 1 operation of sum+= images[x,y] * filter[s,t] need 2 buffer-load + 5 muladd. it can be optimized to non-buffer load, 1 local memory read + 1 muladd. I will have the final version next Monday.

Interesting. Sounds good :-)

fsword73 commented 8 years ago

Is there same profiling method for deepcl?

hughperkins commented 8 years ago

Oh Heh! I was thinking cltorch. Ummm... yes....

./deepcl_train dumptimings=1
fsword73 commented 8 years ago

I got the reason why Convolve is so slow in many cases.
BackpropWeightsScratch/BackpropWeightsScratchLarge are top2 bottlenecks. The reason is that there are only 256 threads in total. GPU can not have enough workload to parallel. So the current kernel is limited by long latency of sequence to execute batchSize. not by Texture or ALU speed.
It could speedd-up 20x even more by split the shader into 2 kernels: first kernel to execute InputPlane * OutputPlane * batchSize. The second Kernel to sum batchsize of InputPlane * OutputPlane.

for (int n = 0; n < batchSize; n++) {  // only 256 Threads?  batchSize=128 could have 128 *256 threads. 
    barrier(CLK_LOCAL_MEM_FENCE);
    copyLocal(_imageImage, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared);
    copyLocal(_errorImage, gradOutput + (n * gNumFilters + outPlane) * gOutputSizeSquared, gOutputSizeSquared);
    barrier(CLK_LOCAL_MEM_FENCE);
    if (localId < gFilterSizeSquared) {
        for (int outRow = 0; outRow < gOutputSize; outRow++) {
            int upstreamRow = outRow - gMargin + filterRow;
            for (int outCol = 0; outCol < gOutputSize; outCol++) {
                const int upstreamCol = outCol - gMargin + filterCol;
                #define proceed (upstreamRow >= 0 && upstreamCol >= 0 && upstreamRow < gInputSize && upstreamCol < gInputSize)
                if (proceed) {
                    // these defines reduce register pressure, compared to const
                    // giving a 40% speedup on nvidia :-)
                    #define resultIndex (outRow * gOutputSize + outCol)
                    #define error (_errorImage[resultIndex])
                    //const float error = _errorImage[resultIndex];
                    #define upstreamDataIndex (upstreamRow * gInputSize + upstreamCol)
                    #define upstreamResult (_imageImage[upstreamDataIndex])
                    thiswchange += upstreamResult * error;
#ifdef BIASED
                    thisbiaschange += error;
#endif
                }
            }
        }
    }

StatefulTimer readings: GpuOp::apply inplace end: 12448ms count=112560 GpuOp::apply inplace start: 699ms count=112560 end SoftMaxLayer calcGradInputfromlabels: 16ms count=2814 end SoftMaxLayer calcLossfromlabels: 31ms count=3283 layer10 ActivationBackwardGpuNaive::backward end: 376ms count=2814 layer10 ActivationBackwardGpuNaive::backward start: 127ms count=2814 layer10 ActivationForwardGpuNaive::forward end: 264ms count=2814 layer11 forward layer 11, after clFinish: 16ms count=2814 layer11 forward layer 11, START: 30ms count=2814 layer11 forward layer 11, copied to device: 31ms count=2814 layer11 AddBias::forward after repeatedAdd: 325ms count=2814 layer11 AddBias::forward begin: 16ms count=2814 layer11 BackpropWeightsNaive end: 596ms count=2814 layer11 BackpropWeightsNaive start: 16ms count=2814 layer11 BackwardGpuNaive after first kernel: 387ms count=2814 layer11 BackwardGpuNaive start: 451ms count=2814 layer11 Forward1::forward START: 15ms count=2345 layer11 Forward1::forward after call forward: 434ms count=2345 layer11 Forward2::forward after call forward: 3817ms count=469 layer11 backprop(): start, layer 11: 31ms count=2814 layer11 backproperrors(): done calc gradWeights, layer 11: 15ms count=2814 layer12 end SoftMaxLayer forward: 595ms count=2814 layer12 start SoftMaxLayer forward: 31ms count=2814 layer3 forward layer 3, after clFinish: 63ms count=2814 layer3 forward layer 3, START: 2313ms count=2814 layer3 forward layer 3, copied to device: 966ms count=2814 layer3 AddBias::forward after repeatedAdd: 517ms count=2814 layer3 AddBias::forward begin: 32ms count=2814 layer3 BackpropWeightsScratch end: 75922ms count=2345 layer3 BackpropWeightsScratch start: 722ms count=2345 layer3 BackpropWeightsScratchLarge end: 15100ms count=469 layer3 BackpropWeightsScratchLarge start: 95ms count=469 layer3 Forward1::forward START: 15ms count=1876 layer3 Forward1::forward after call forward: 978ms count=1876 layer3 Forward4::forward after call forward: 359ms count=938 layer3 backprop(): start, layer 3: 16ms count=2814 layer3 backproperrors(): done calc gradWeights, layer 3: 32ms count=2814 layer4 ActivationBackwardGpuNaive::backward end: 577ms count=2814 layer4 ActivationForwardGpuNaive::forward end: 345ms count=2814 layer4 ActivationForwardGpuNaive::forward start: 76ms count=2814 layer5 PoolingBackwardGpuNaive::backward end: 933ms count=2814 layer5 PoolingBackwardGpuNaive::backward start: 48ms count=2814 layer5 PoolingForwardGpuNaive::forward end: 438ms count=2814 layer5 PoolingForwardGpuNaive::forward start: 16ms count=2814 layer6 forward layer 6, after clFinish: 30ms count=2814 layer6 forward layer 6, START: 15ms count=2814 layer6 AddBias::forward after repeatedAdd: 375ms count=2814 layer6 BackpropWeightsScratch end: 11771ms count=2345 layer6 BackpropWeightsScratchLarge end: 2714ms count=469 layer6 BackwardGpuCached after first kernel: 391ms count=469 layer6 BackwardGpuNaive after first kernel: 3419ms count=2345 layer6 BackwardGpuNaive end: 16ms count=2345 layer6 BackwardGpuNaive start: 15ms count=2345 layer6 Forward1::forward after call forward: 3444ms count=2814 layer6 backprop(): start, layer 6: 32ms count=2814 layer6 backproperrors(): calced gradInput, layer 6: 15ms count=2814 layer6 backproperrors(): done calc gradWeights, layer 6: 76ms count=2814 layer7 ActivationBackwardGpuNaive::backward end: 526ms count=2814 layer7 ActivationForwardGpuNaive::forward end: 343ms count=2814 layer7 ActivationForwardGpuNaive::forward start: 16ms count=2814 layer8 PoolingBackwardGpuNaive::backward end: 830ms count=2814 layer8 PoolingBackwardGpuNaive::backward start: 16ms count=2814 layer8 PoolingForwardGpuNaive::forward end: 423ms count=2814 layer8 PoolingForwardGpuNaive::forward start: 47ms count=2814 layer9 forward layer 9, after clFinish: 31ms count=2814 layer9 forward layer 9, START: 30ms count=2814 layer9 forward layer 9, copied to device: 16ms count=2814 layer9 AddBias::forward after repeatedAdd: 250ms count=2814 layer9 BackpropWeightsNaive end: 597ms count=2814 layer9 BackwardGpuNaive after first kernel: 887ms count=2814 layer9 BackwardGpuNaive start: 15ms count=2814 layer9 Forward1::forward after call forward: 1837ms count=2345 layer9 Forward2::forward after call forward: 1207ms count=469 layer9 backprop(): start, layer 9: 15ms count=2814 layer9 backproperrors(): calced gradInput, layer 9: 30ms count=2814 layer9 backproperrors(): done calc gradWeights, layer 9: 32ms count=2814 start SoftMaxLayer calcLossfromlabels: 30ms count=3283 start SoftMaxLayer calcNumRight: 79ms count=6566

after epoch 20 149900 ms training loss: 742270 train accuracy: 6100/60000 10.1667% test accuracy: 9949/10000 99.49% after tests 2418 ms record epoch=20 wrote weights to file, filesize 173KB dump enabled=1 StatefulTimer readings: layer10 ActivationForwardGpuNaive::forward end: 47ms count=474 layer11 AddBias::forward after repeatedAdd: 16ms count=474 layer11 Forward2::forward after call forward: 577ms count=79 layer12 end SoftMaxLayer forward: 32ms count=474 layer3 forward layer 3, START: 171ms count=474 layer3 forward layer 3, copied to device: 79ms count=474 layer3 AddBias::forward after repeatedAdd: 62ms count=474 layer3 Forward1::forward after call forward: 142ms count=316 layer3 Forward4::forward after call forward: 30ms count=158 layer4 ActivationForwardGpuNaive::forward end: 47ms count=474 layer5 PoolingForwardGpuNaive::forward end: 32ms count=474 layer6 AddBias::forward after repeatedAdd: 47ms count=474 layer6 Forward1::forward after call forward: 579ms count=474 layer7 ActivationForwardGpuNaive::forward end: 46ms count=474 layer8 PoolingForwardGpuNaive::forward end: 16ms count=474 layer9 Forward1::forward after call forward: 280ms count=395 layer9 Forward2::forward after call forward: 215ms count=79 clblas teardown

hughperkins commented 8 years ago

Ok. Sorry for not replying to this. I think this has been superceded by your newer issue asking about priorities for various libraries right? I dont see why all libraries cant share the same convolutional implementation(s), so if the optimization you detail gives acceleration relative to the fastest opencl implementation you can find, for the geometries you are tesitng, then let's get it into some single central opencl convolutional library. And if it doesnt, well, let's look at switching deepcl to use that faster existing implementation. None of this is fixed in stone really. Whatever it takes to get as many opencl libraries as possible using the fastest convolutional implementations we can come up with, I reckon? :-)

hughperkins commented 8 years ago

@fsword73, just to confirm, your kernel at https://raw.githubusercontent.com/fsword73/CNNKernelPerfTest/master/CL/test_kernel_backpropweights_fast.cl should be slotted into BackpropWeightsScratchLarge.cpp , is that right?

hughperkins commented 8 years ago

Note: I've created a branch to start to look at this here: https://github.com/hughperkins/DeepCL/tree/fsword73-kernel1

I wont have time to look in much detail before Sunday though.

hughperkins commented 8 years ago

Hi @fsword73

I've created a branch for this at https://github.com/hughperkins/DeepCL/tree/fsword73-kernel1

./deepcl_train

... which will train on mnist, by default.

With existing brackpropweightswithscratchlarge kernel, the timings for epoch 2 is:

after epoch 2 11755 ms
 training loss: 5693.41
 train accuracy: 58236/60000 97.06%
test accuracy: 9868/10000 98.68%
after tests 641 ms
record epoch=2
wrote weights to file, filesize 173KB

With your kernel, the timing for epoch 2 is:

after epoch 2 12072 ms
 training loss: 5969.54
 train accuracy: 58147/60000 96.9117%
test accuracy: 9756/10000 97.56%
after tests 704 ms
record epoch=2
wrote weights to file, filesize 173KB

Thoughts?

fsword73 commented 8 years ago

My change is based on "backpropweights.cl". So force all backpropweights to use the new one.
The basic one is to make full use of GPU cores. Whatever original backpropweights.cl or BackpropWeightsScratch.cl or BackpropWeightsScratchLarge.cl just uses limited cores. Usually, GPU has 2048+ cores. To make good use of the 2000+ Cores, at least 2048 * 16 = 32K threads.
Please change the global threads by [outPlane][inputPlane][filterRow][filterCol] * BatchSize * 64.

fsword73 commented 8 years ago

Your Fix is good for #BIAS.

hughperkins commented 8 years ago

My change is based on "backpropweights.cl". So force all backpropweights to use the new one.

Ok, currently only BackpropWeightsNaive uses backpropweights.cl:

$ grep backpropweights.cl *
BackpropWeightsNaive.cpp:    // stringify.write_kernel2("kernel", "cl/backpropweights.cl", "backprop_floats", 'options')
BackpropWeightsNaive.cpp:    // generated using cog, from cl/backpropweights.cl:
BackpropWeightsNaive.cpp:    kernel = cl->buildKernelFromString(kernelSource, "backprop_floats", options, "cl/backpropweights.cl");
hughperkins commented 8 years ago

Please change the global threads by [outPlane][inputPlane][filterRow][filterCol] * BatchSize * 64.

Ok, you mean, the workgroupsize will be 64, and there will be [outPlane][inputPlane][filterRow][filterCol] * BatchSize workgroups?

hughperkins commented 8 years ago

Ok, updated in a0dda75 on this branch https://github.com/hughperkins/DeepCL/tree/fsword73-kernel1

I'm running it now, by simply calling ./deepcl_train, on an NVIDIA 940M, as before. Epoch 1 timing is:

after epoch 1 186429 ms
 training loss: nan
 train accuracy: 5918/60000 9.86333%
test accuracy: 980/10000 9.8%
after tests 801 ms
record epoch=1
wrote weights to file, filesize 173KB

Thoughts?

fsword73 commented 8 years ago

Let me build the branch. It needs several days since I did not build the master yet. 940m has 384 cuda cores, so minimum total threads will be 64 * 6 * 4 to reach the peak performance. Do you use 32 bits or 64 bits builds? 64 bits will be 4x slower than 32 bits in most cases.

fsword73 commented 8 years ago

Do you have full dumping for each kernel time , before the code and after the code?

fsword73 commented 8 years ago

I got why From Layer 9, 10, 11, 12, the image Size == 1. But the OutputSize = 1x1. So that my kernel is very bad. It will be 100x slower than ScratchLarge or Sratch.
It changes from <1s to 65seconds.

 For example,dim.numFilters 150 dim.inputPlanes 16 dim.filterSize 4 dim.outputSize 1,
  batchSize 128 or BatchSize96 ,  a little change to Scratch large will be good. 

Workgroups = [dim.numFilters] [dim.inputPlanes][dim.filterSize][dim.filterSize]
workgroupSize =  64;  
         Every 4 threads to calculate 1 filter.  
         Each thread will do batchSize/4  * output Size  = 32 Batch per threads
         Reduction 4 threads to final results. 
        Total    [dim.numFilters] [dim.inputPlanes][dim.filterSize][dim.filterSize][4] threads

     No Atomic is needed. 

Another choice is to

For Big OutputSize 28, for example Layer3, It reduces from 46 seconds to 0.5 seconds in my test platform.

I will rewrite another version to deal with small OutputSize.

layer 0:InputLayer{ outputPlanes=1 outputSize=28 } layer 1:NormalizationLayer{ outputPlanes=1 outputSize=28 translate=-33.5744 scale=0.0063936 } layer 2:RandomTranslations{ inputPlanes=1 inputSize=28 translateSize=2 } layer 3:ConvolutionalLayer{ LayerDimensions{ inputPlanes=1 inputSize=28 numFilters=8 filterSize=5 outputSize=28 padZeros=1 biased=1 skip=0} } layer 4:ActivationLayer{ RELU } layer 5:PoolingLayer{ inputPlanes=8 inputSize=28 poolingSize=2 } layer 6:ConvolutionalLayer{ LayerDimensions{ inputPlanes=8 inputSize=14 numFilters=16 filterSize=5 outputSize=14 padZeros=1 biased=1 skip=0} } layer 7:ActivationLayer{ RELU } layer 8:PoolingLayer{ inputPlanes=16 inputSize=14 poolingSize=3 } layer 9:FullyConnectedLayer{ numPlanes=150 imageSize=1 } layer 10:ActivationLayer{ TANH } layer 11:FullyConnectedLayer{ numPlanes=10 imageSize=1 } layer 12:SoftMaxLayer{ perPlane=0 numPlanes=10 imageSize=1 } Parameters overview: (skipping 8 layers with 0 params)

layer 0:InputLayer{ outputPlanes=1 outputSize=28 } layer 1:NormalizationLayer{ outputPlanes=1 outputSize=28 translate=-33.5744 scale=0.0063936 } layer 3:ConvolutionalLayer{ LayerDimensions{ inputPlanes=1 inputSize=28 numFilters=8 filterSize=5 outputSize=28 padZeros=1 biased=1 skip=0} } layer 6:ConvolutionalLayer{ LayerDimensions{ inputPlanes=8 inputSize=14 numFilters=16 filterSize=5 outputSize=14 padZeros=1 biased=1 skip=0} }

batchSize 128 dim.numFilters 16 dim.inputPlanes 8 dim.filterSize 5 dim.outputSize 14 batchSize 128 dim.numFilters 8 dim.inputPlanes 1 dim.filterSize 5 dim.outputSize 28 batchSize 96 dim.numFilters 16 dim.inputPlanes 8 dim.filterSize 5 dim.outputSize 14 batchSize 96 dim.numFilters 8 dim.inputPlanes 1 dim.filterSize 5 dim.outputSize 28 dim.numFilters 150 dim.inputPlanes 16 dim.filterSize 4 dim.outputSize 1 batchSize 128 batchSize 96 dim.numFilters 150 dim.inputPlanes 16 dim.filterSize 4 dim.outputSize 1 batchSize 128 dim.numFilters 10 dim.inputPlanes 150 dim.filterSize 1 dim.outputSize 1 batchSize 96 dim.numFilters 10 dim.inputPlanes 150 dim.filterSize 1 dim.outputSize 1

hughperkins commented 8 years ago

I will rewrite another version to deal with small OutputSize.

Ok, sounds good. So, for now, I will wait for the additional kernel for small outputsize? (Alternatively, I have a class BackpropWeightsAuto, which measures the time for each kernel, and chooses the fastest; which I've disabled temporarily, but I could put that back, so that it chooses the fastest kernel for each layer, based on measurements at runtime?)

Do you use 32 bits or 64 bits builds? 64 bits will be 4x slower than 32 bits in most cases.

My operating system is 64-bits. However my floats are all 32-bit floats, and the floats inside the GPU are 32-bits too. I'm happy to build on a 32-bit os instead though, if you prefer?

fsword73 commented 8 years ago

64bits kernel in 64bits apps is much slower than 32 bits apps because of 64 bits address computing. If the memory size is less than 4GB per buffer, it is better to choose 32 bits apps. I can add 1 more paramter to choose 8 threads per batchsize. const int threadsPerBatchSize I will find the best formula tomorrow, the initial formula will be threadsPerBatchSize == (64 * BatchSize ) === 64 threads per 1 BatchID -- outputSize > 16 threadsPerBatchSize == (64 ) === 64 threads per BatchSize -- no atomic --- outputSize > 8 threadsPerBatchSize == (64 ) === 64 threads per BatchSize --- filterSize =1&& OutputSIze==1 threadsPerBatchSize == (8 ) === 8 threads per BatchSize -- no atomic --- outputSize [2--7] threadsPerBatchSize == (4 ) === 4 threads per BatchSize -- no atomic --- outputSize ==1

Yes. You can put back the BackpropWeightsAuto.

hughperkins commented 8 years ago

Yes. You can put back the BackpropWeightsAuto.

Ok. It is re-activated in 5471770 . Also, because the earlier BackpropWeightsAuto only tries each kernel once, and therefore includes opencl compilation time, I make it try each kernel 3 times now, before choosing the fastest. The results of doing this on default ./deepcl_train are:

   calcGradWeights kernel 0: cannot be used
   calcGradWeights kernel 1 time: 0ms
   calcGradWeights kernel 2 time: 1ms
   calcGradWeights kernel 3 time: 1ms
   calcGradWeights kernel 4 time: 32ms
   calcGradWeights kernel 5 time: 11ms
   calcGradWeights layer selected kernel 1
   calcGradWeights kernel 0: cannot be used
   calcGradWeights kernel 1 time: 0ms
   calcGradWeights kernel 2 time: 6ms
   calcGradWeights kernel 3 time: 4ms
   calcGradWeights kernel 4 time: 34ms
   calcGradWeights kernel 5 time: 342ms
   calcGradWeights layer selected kernel 1
   calcGradWeights kernel 0: cannot be used
   calcGradWeights kernel 1 time: 8ms
   calcGradWeights kernel 2 time: 5ms
   calcGradWeights kernel 3 time: 4ms
   calcGradWeights kernel 4 time: 36ms
   calcGradWeights kernel 5 time: 31ms
   calcGradWeights layer selected kernel 3
   calcGradWeights kernel 0: cannot be used
   calcGradWeights kernel 1 time: 12ms
   calcGradWeights kernel 2 time: 5ms
   calcGradWeights kernel 3 time: 6ms
   calcGradWeights kernel 4 time: 33ms
   calcGradWeights kernel 5 time: 4ms
   calcGradWeights layer selected kernel 5

Your kernel is kernel 5 here, https://github.com/hughperkins/DeepCL/blob/fsword73-kernel1/src/conv/BackpropWeights.cpp#L85

Epoch 2 batch time is:

after epoch 2 12324 ms
 training loss: 40627.2
 train accuracy: 47251/60000 78.7517%
test accuracy: 8418/10000 84.18%
after tests 783 ms
record epoch=2
wrote weights to file, filesize 173KB
fsword73 commented 8 years ago

test accuracy 8418/10000 * 84.18%* is not good because the output memory is not initialized in the shader. I have fixed it. git push origin fsword73-kernel1 remote: Permission to hughperkins/DeepCL.git denied to fsword73. fatal: unable to access 'https://github.com/hughperkins/DeepCL.git/': The requested URL returned error: 403

File uploaded to https://github.com/fsword73/CNNKernelPerfTest/tree/master/CL/deepcl_tmp

The BackWeightAuto does not work well for big graphics chip. The unit millisecond is too high unit. most Kernels run time is Zero for 1 run.

my 1st time run without any change after epoch 2 145278 ms training loss: 524549 train accuracy: 6100/60000 10.1667% test accuracy: 9892/10000 98.92% now with my fixed kernel after epoch 2 95909 ms training loss: 409282 train accuracy: 6098/60000 10.1633% test accuracy: 9890/10000 98.9% after tests 858 ms

BackWeightAuto always brings out video memory size issue Many instances make that some buffer/kernels will be in system memory. It runs much slower even the BackWeight kernel is very fast. For example, GPUOP::Apply in place will be very slow in this case. It took me 1 day to understand this issue .

hughperkins commented 8 years ago

8418/10000 * 84.18%* is not good because the output memory is not initialized in the shader. I have fixed it.

Ok, sounds good. Actually, I havent considered accuracy at this point. For epoch 2 84% sounds ok. Normally one runs ~10 epochs or so to get good accuracy. For now, I'm only considering timings to be honest.

git push origin fsword73-kernel1 remote: Permission to hughperkins/DeepCL.git denied to fsword73.

Ok, sure. Can you do the following please?

git remote add fsword73 https://github.com/fsword73/DeepCL.git
git push fsword73 fsword73-kernel1

This will add a new remote, pointing to your own account, and push the repo to your own account.

If it complains the repo doesnt exist after doing this, then you can try:

git push --set-upstream fsword73 fsword73-kernel1

(Once it is uploaded to your own account, I can merge it across into mine, or just clone directly from yours)

The unit millisecond is too high unit. most Kernels run time is Zero for 1 run.

Alright. We can disable auto for now perhaps. Less things to think about.

BackWeightAuto always brings out video memory size issue

Yes, ok, let's disable auto for now. To disable auto, simply change lines 35-37 of src/conv/BackpropWeights.cpp to:

STATIC BackpropWeights *BackpropWeights::instance(EasyCL *cl, LayerDimensions dim) {
    return new BackpropWeightsFsword73(cl, dim);
//    return new BackpropWeightsAuto(cl, dim);

This will make it always use your kernel directly, ie BackpropWeightsFsword73 class, instead of using the BackpropWeightsAuto class.

hughperkins commented 8 years ago

File uploaded to https://github.com/fsword73/CNNKernelPerfTest/tree/master/CL/deepcl_tmp

Better for you to make the change, push to your own fork of DeepCL, and I merge it across, I think. Then your changes will be visible under contributors (but if you prefer, I can commit your changes in directly, either way is ok for me)

fsword73 commented 8 years ago

I have forked one. Hope to have progress today for gOuputSize1x1 case

hughperkins commented 8 years ago

Ok, sounds good :-)

fsword73 commented 8 years ago

1) update the kernel code to initialize to 0 2) upload another kernel to support gOutputSize=1 3) Still can not resolve the memory allocation. It is very strange that one of the buffer will be allocated in unknown place. The PoolingBackwardGpuNaive::KMemset will be very slow , it is only 100-150MBytes/second. It becomes the major bottleneck.
Deleting of unused forward/backward instances does not resolve the issue. But the memory usage shows only 141MBytes. It is more like a driver issue.

4) IF it does not consider issue 3), it can achieve 21 seconds with original 140 seconds.

hughperkins commented 8 years ago

Cool. Merged in your changes. Seems some buggettes on linux:

/home/ubuntu/git/DeepCL/src/util/Timer.h: In constructor ‘Timer::Timer()’:
/home/ubuntu/git/DeepCL/src/util/Timer.h:36:5: error: ‘LARGE_INTEGER’ was not declared in this scope
     LARGE_INTEGER frequency;
     ^
/home/ubuntu/git/DeepCL/src/util/Timer.h:37:32: error: ‘frequency’ was not declared in this scope
     QueryPerformanceFrequency(&frequency);
                                ^
/home/ubuntu/git/DeepCL/src/util/Timer.h:37:41: error: ‘QueryPerformanceFrequency’ was not declared in this scope
     QueryPerformanceFrequency(&frequency);
                                         ^
/home/ubuntu/git/DeepCL/src/util/Timer.h:38:5: error: ‘invFrequency’ was not declared in this scope
     invFrequency = 1.0 / frequency.QuadPart;
     ^
/home/ubuntu/git/DeepCL/src/util/Timer.h: In member function ‘double Timer::ellaspedMicroseconds()’:
/home/ubuntu/git/DeepCL/src/util/Timer.h:101:10: error: ‘timemicroseconds’ was not declared in this scope
   return timemicroseconds;

I will take a look at this point

hughperkins commented 8 years ago

(Fixed Timer on linux; you'd better check it still works ok on windows though )

hughperkins commented 8 years ago

(you'll need to do something like:

git fetch origin
git merge origin/fsword73-kernel1

... to see my Timer.h changes for Linux)

hughperkins commented 8 years ago

Oh, epoch time is down to 8.5seconds, on 940M. Nice! Seems some issue with convergence currently?

after epoch 2 8512 ms
 training loss: 2.43354e+06
 train accuracy: 6265/60000 10.4417%
test accuracy: 1028/10000 10.28%
after tests 638 ms
record epoch=2
wrote weights to file, filesize 173KB

after epoch 3 8522 ms
 training loss: inf
 train accuracy: 6265/60000 10.4417%
test accuracy: 1028/10000 10.28%
after tests 634 ms
record epoch=3
wrote weights to file, filesize 173KB
fsword73 commented 8 years ago

I got why of convergence. Test accuracy 10.4417% The "cog" change used wrong file for https://github.com/hughperkins/DeepCL/blob/fsword73-kernel1/src/conv/BackpropWeightsFsword73_BatchSize.cpp

The shader should be BackpropWeightsFsword73_BatchSize.cl. However in the file, it uses BackpropWeightsFsword73.cl to produce the cog section . The kernel build line also uses wrong file name.

Do not know how to produce cog section.

hughperkins commented 8 years ago

Ah. Interesting. Thats strange.

Bear in mind, not the same GPU either. I'm using NVIDIA (but I also have an Intel HD available too)

fsword73 commented 8 years ago

A lilt the change of BackpropWeightsFsword73_batchsize.cl. It should be a little faster. The cpp is not changed since I don't know what is cog. Cog is a command line tool? Or sed or awk script? The memory allocation issue is filed a bug to driver team. It is more like driver issue

hughperkins commented 8 years ago

A lilt the change of BackpropWeightsFsword73_batchsize.cl. It should be a little faster.

Nice! Sounds good :-)

The cpp is not changed since I don't know what is cog. Cog is a command line tool? Or sed or awk script?

Its a python script http://nedbatchelder.com/code/cog/

I will update BackpropWeightsFsword73_BatchSize.cpp to write in the contents of cl/fsword73_backpropweights_fast_batchSize.cl into BackpropWeightsFsword73_BatchSize.cpp , and rerun cog

hughperkins commented 8 years ago

2f86260 contains the change following:

I'm still getting nans, when I run on NVIDIA 940M, for now:

after epoch 1 32761 ms
 training loss: nan
 train accuracy: 6253/60000 10.4217%
test accuracy: 980/10000 9.8%
after tests 667 ms
record epoch=1
wrote weights to file, filesize 173KB

after epoch 2 11662 ms
 training loss: nan
 train accuracy: 5923/60000 9.87167%
test accuracy: 980/10000 9.8%
after tests 683 ms
record epoch=2
wrote weights to file, filesize 173KB
fsword73 commented 8 years ago

Let me double check tomorrow . It should have 99% test accuracy.

hughperkins commented 8 years ago

Cool. Sounds good :-)

fsword73 commented 8 years ago

Fixed the issue and complete the fastest version till now. All selected the fastest one: fsword73_backpropweights_fast_batchSize.cl. I believe that there are still 1x performance improve space by reuse inputPlane or outplane . Expect to have more work on it after my 2 interns were hired

after epoch 2 20203 ms training loss: 524697 train accuracy: 6100/60000 10.1667% test accuracy: 9883/10000 98.83% after tests 916 ms record epoch=2 wrote weights to file, filesize 173KB

calcGradWeights kernel 0: cannot be used calcGradWeights kernel 1 time: 4262 microsecond calcGradWeights kernel 2 time: 467 microsecond calcGradWeights kernel 3 time: 1132 microsecond calcGradWeights kernel 4 time: 5740 microsecond calcGradWeights kernel 5 time: 61 microsecond calcGradWeights kernel 6 time: 355 microsecond calcGradWeights layer selected kernel 5 calcGradWeights kernel 0: cannot be used calcGradWeights kernel 1 time: 11923 microsecond calcGradWeights kernel 2 time: 898 microsecond calcGradWeights kernel 3 time: 1142 microsecond calcGradWeights kernel 4 time: 98168 microsecond calcGradWeights kernel 5 time: 575 microsecond calcGradWeights kernel 6 time: 4301 microsecond calcGradWeights layer selected kernel 5 calcGradWeights kernel 0: cannot be used calcGradWeights kernel 1 time: 11455 microsecond calcGradWeights kernel 2 time: 5523 microsecond calcGradWeights kernel 3 time: 1116 microsecond calcGradWeights kernel 4 time: 11208 microsecond calcGradWeights kernel 5 time: 812 microsecond calcGradWeights kernel 6 time: 64490 microsecond calcGradWeights layer selected kernel 5 calcGradWeights kernel 0: cannot be used calcGradWeights kernel 1 time: 34596 microsecond calcGradWeights kernel 2 time: 34603 microsecond calcGradWeights kernel 3 time: 1008 microsecond calcGradWeights kernel 4 time: 1001 microsecond calcGradWeights kernel 5 time: 592 microsecond calcGradWeights kernel 6 time: 21866 microsecond calcGradWeights layer selected kernel 5 calcGradWeights kernel 0: cannot be used calcGradWeights kernel 1 time: 3971 microsecond calcGradWeights kernel 2 time: 468 microsecond calcGradWeights kernel 3 time: 1121 microsecond calcGradWeights kernel 4 time: 5755 microsecond calcGradWeights kernel 5 time: 61 microsecond calcGradWeights kernel 6 time: 365 microsecond calcGradWeights layer selected kernel 5 calcGradWeights kernel 0: cannot be used calcGradWeights kernel 1 time: 11939 microsecond calcGradWeights kernel 2 time: 899 microsecond calcGradWeights kernel 3 time: 1121 microsecond calcGradWeights kernel 4 time: 103035 microsecond calcGradWeights kernel 5 time: 576 microsecond calcGradWeights kernel 6 time: 4352 microsecond calcGradWeights layer selected kernel 5 calcGradWeights kernel 0: cannot be used calcGradWeights kernel 1 time: 10262 microsecond calcGradWeights kernel 2 time: 5519 microsecond calcGradWeights kernel 3 time: 1107 microsecond calcGradWeights kernel 4 time: 11560 microsecond calcGradWeights kernel 5 time: 815 microsecond calcGradWeights kernel 6 time: 64618 microsecond calcGradWeights layer selected kernel 5 calcGradWeights kernel 0: cannot be used calcGradWeights kernel 1 time: 34611 microsecond calcGradWeights kernel 2 time: 34948 microsecond calcGradWeights kernel 3 time: 1018 microsecond calcGradWeights kernel 4 time: 1000 microsecond calcGradWeights kernel 5 time: 594 microsecond calcGradWeights kernel 6 time: 21971 microsecond calcGradWeights layer selected kernel 5 calcGradWeights kernel 0: cannot be used calcGradWeights kernel 1 time: 3976 microsecond calcGradWeights kernel 2 time: 470 microsecond calcGradWeights kernel 3 time: 1133 microsecond calcGradWeights kernel 4 time: 5753 microsecond calcGradWeights kernel 5 time: 61 microsecond calcGradWeights kernel 6 time: 356 microsecond calcGradWeights layer selected kernel 5 calcGradWeights kernel 0: cannot be used calcGradWeights kernel 1 time: 11955 microsecond calcGradWeights kernel 2 time: 901 microsecond calcGradWeights kernel 3 time: 1149 microsecond calcGradWeights kernel 4 time: 68970 microsecond calcGradWeights kernel 5 time: 575 microsecond calcGradWeights kernel 6 time: 4318 microsecond calcGradWeights layer selected kernel 5 calcGradWeights kernel 0: cannot be used calcGradWeights kernel 1 time: 10243 microsecond calcGradWeights kernel 2 time: 5523 microsecond calcGradWeights kernel 3 time: 1115 microsecond calcGradWeights kernel 4 time: 9591 microsecond calcGradWeights kernel 5 time: 797 microsecond calcGradWeights kernel 6 time: 64483 microsecond calcGradWeights layer selected kernel 5 calcGradWeights kernel 0: cannot be used calcGradWeights kernel 1 time: 33933 microsecond calcGradWeights kernel 2 time: 34645 microsecond calcGradWeights kernel 3 time: 1023 microsecond calcGradWeights kernel 4 time: 990 microsecond calcGradWeights kernel 5 time: 592 microsecond calcGradWeights kernel 6 time: 21797 microsecond calcGradWeights layer selected kernel 5 calcGradWeights kernel 0: cannot be used calcGradWeights kernel 1 time: 4106 microsecond calcGradWeights kernel 2 time: 468 microsecond calcGradWeights kernel 3 time: 1126 microsecond calcGradWeights kernel 4 time: 5756 microsecond calcGradWeights kernel 5 time: 61 microsecond calcGradWeights kernel 6 time: 359 microsecond calcGradWeights layer selected kernel 5 calcGradWeights kernel 0: cannot be used calcGradWeights kernel 1 time: 11944 microsecond calcGradWeights kernel 2 time: 894 microsecond calcGradWeights kernel 3 time: 1121 microsecond calcGradWeights kernel 4 time: 72467 microsecond calcGradWeights kernel 5 time: 575 microsecond calcGradWeights kernel 6 time: 4314 microsecond calcGradWeights layer selected kernel 5 calcGradWeights kernel 0: cannot be used calcGradWeights kernel 1 time: 10258 microsecond calcGradWeights kernel 2 time: 5523 microsecond calcGradWeights kernel 3 time: 1112 microsecond calcGradWeights kernel 4 time: 9922 microsecond calcGradWeights kernel 5 time: 807 microsecond calcGradWeights kernel 6 time: 64541 microsecond calcGradWeights layer selected kernel 5 calcGradWeights kernel 0: cannot be used calcGradWeights kernel 1 time: 33917 microsecond calcGradWeights kernel 2 time: 34265 microsecond calcGradWeights kernel 3 time: 1018 microsecond calcGradWeights kernel 4 time: 988 microsecond calcGradWeights kernel 5 time: 594 microsecond calcGradWeights kernel 6 time: 21924 microsecond calcGradWeights layer selected kernel 5 calcGradWeights kernel 0: cannot be used calcGradWeights kernel 1 time: 4101 microsecond calcGradWeights kernel 2 time: 472 microsecond calcGradWeights kernel 3 time: 1125 microsecond calcGradWeights kernel 4 time: 5753 microsecond calcGradWeights kernel 5 time: 61 microsecond calcGradWeights kernel 6 time: 361 microsecond calcGradWeights layer selected kernel 5 calcGradWeights kernel 0: cannot be used calcGradWeights kernel 1 time: 11943 microsecond calcGradWeights kernel 2 time: 897 microsecond calcGradWeights kernel 3 time: 1144 microsecond calcGradWeights kernel 4 time: 85074 microsecond calcGradWeights kernel 5 time: 573 microsecond calcGradWeights kernel 6 time: 4327 microsecond calcGradWeights layer selected kernel 5 calcGradWeights kernel 0: cannot be used calcGradWeights kernel 1 time: 10235 microsecond calcGradWeights kernel 2 time: 5525 microsecond calcGradWeights kernel 3 time: 1129 microsecond calcGradWeights kernel 4 time: 11140 microsecond calcGradWeights kernel 5 time: 806 microsecond calcGradWeights kernel 6 time: 64577 microsecond calcGradWeights layer selected kernel 5 calcGradWeights kernel 0: cannot be used calcGradWeights kernel 1 time: 34372 microsecond calcGradWeights kernel 2 time: 34459 microsecond calcGradWeights kernel 3 time: 1017 microsecond calcGradWeights kernel 4 time: 992 microsecond calcGradWeights kernel 5 time: 592 microsecond calcGradWeights kernel 6 time: 21866 microsecond calcGradWeights layer selected kernel 5 calcGradWeights kernel 0: cannot be used calcGradWeights kernel 1 time: 3999 microsecond calcGradWeights kernel 2 time: 467 microsecond calcGradWeights kernel 3 time: 1125 microsecond calcGradWeights kernel 4 time: 5763 microsecond calcGradWeights kernel 5 time: 61 microsecond calcGradWeights kernel 6 time: 356 microsecond calcGradWeights layer selected kernel 5 calcGradWeights kernel 0: cannot be used calcGradWeights kernel 1 time: 11963 microsecond calcGradWeights kernel 2 time: 894 microsecond calcGradWeights kernel 3 time: 1130 microsecond calcGradWeights kernel 4 time: 82749 microsecond calcGradWeights kernel 5 time: 575 microsecond calcGradWeights kernel 6 time: 4333 microsecond calcGradWeights layer selected kernel 5 calcGradWeights kernel 0: cannot be used calcGradWeights kernel 1 time: 10244 microsecond calcGradWeights kernel 2 time: 5519 microsecond calcGradWeights kernel 3 time: 1101 microsecond calcGradWeights kernel 4 time: 10906 microsecond calcGradWeights kernel 5 time: 803 microsecond calcGradWeights kernel 6 time: 64583 microsecond calcGradWeights layer selected kernel 5 calcGradWeights kernel 0: cannot be used calcGradWeights kernel 1 time: 34641 microsecond calcGradWeights kernel 2 time: 34468 microsecond calcGradWeights kernel 3 time: 1001 microsecond calcGradWeights kernel 4 time: 1001 microsecond calcGradWeights kernel 5 time: 590 microsecond calcGradWeights kernel 6 time: 21839 microsecond calcGradWeights layer selected kernel 5

hughperkins commented 8 years ago

after epoch 2 20203 ms training loss: 524697 train accuracy: 6100/60000 10.1667% test accuracy: 9883/10000 98.83%

Ok. Generally speaking, the test accuracy will tend to be slightly worse than the training accuracy. There are occasional exceptions, especially in the earlier epochs, but it tends to be relatively small exceptions, and just because the start of the epoch there are tons of mistakes, but by the end of hte epoch the network has learned quite well.

But in mnist, there are only 10 possible answers, like: 1,2,3 ... 9, or 0. So, if one chose randomly, one would get the answers right about 10% fo the time. And you can see that the training accuracy is about 10%, ie random chance.

Thoughts?

fsword73 commented 8 years ago

It is really strange. Look at my 1st run with the downloaded win7-64 bits dist binary. The train accuracy is only 10%. I uses the command line deepcl_train and the arguments from the 1st page of deepcl. I can try deepcl.exe without any arguments. Are these 2 arguments different?

fsword73 commented 8 years ago

Paste my 1st run with downloaded binary

after epoch 20 149900 ms training loss: 742270 train accuracy: 6100/60000 10.1667% test accuracy: 9949/10000 99.49%

hughperkins commented 8 years ago

Thats pretty odd... Can you provide:

fsword73 commented 8 years ago

If it uses the recommend command line from https://github.com/hughperkins/DeepCL/blob/master/README.md

deepcl_train.exe netdef=rt2-8c5z-relu-mp2-16c5z-relu-mp3-150n-tanh-10n numepochs=2 multinet=6 learningrate=0.002 dumptimings=1 The train accuracy is 10%

If I changed to deepcl_train.exe dumptimings=1, it is very good.

So it is a communication issue or an issue in READ.MD

after epoch 1 9604 ms training loss: 18447.4 train accuracy: 54200/60000 90.3333% test accuracy: 9668/10000 96.68% after tests 154 ms record epoch=1 wrote weights to file, filesize 173KB dump enabled=1

after epoch 2 3316 ms training loss: 5560.06 train accuracy: 58265/60000 97.1083% test accuracy: 9808/10000 98.08% after tests 156 ms record epoch=2 wrote weights to file, filesize 173KB

hughperkins commented 8 years ago

Ok. Seems there is some issue with the commandline in my readme. Maybe hte multinet. I should check this point sometime.

With the simpler commandline, the results look much more normal :-)

Note that on linux the results do not look very normal:

after epoch 1 14451 ms
 training loss: nan
 train accuracy: 6274/60000 10.4567%
test accuracy: 980/10000 9.8%
after tests 637 ms
record epoch=1
wrote weights to file, filesize 173KB

I suspect this could be a difference in gpu, rather than os. Unfortunately I dont have time to debug this right now. Do you mind spinning up an ec2 or a nimbix instance and taking a look? Personally I think the nimbix option is the easier one by the way, since they've pre-installed all the drivers, and their billing is per-second.

hughperkins commented 8 years ago

(Hmmm, alternatively, what we could do is switch kernels according to which gpu is being used, ie:

hughperkins commented 8 years ago

I think I'd rather you create AMD-specific kernsl on the whole, and make those super amazing fast :-) than spend time digging around with other gpus and stuff. So maybe we go for the solution of using your kernels on AMD, and existing kernels on other gpus?

fsword73 commented 8 years ago

My basic idea does not make the difference of GPU. It just considers how to reduce texture load, increase data licality.

Basically I can continue to optimize forward and back weights on Windows+AMD GPU.

I can setup both AMD and nvidia gpu under Linux 1 month later.

fsword73 commented 8 years ago

I got it. Amazon EC2 and nimbix are cloud server.

I have gtx750, 980, 980ti, Titanx now. Hope these cards work well on OpenCL.

fsword73 commented 8 years ago

Today I checked in the latest version which should have good accuracy of test and train.

hughperkins commented 8 years ago

Today I checked in the latest version which should have good accuracy of test and train.

You did! I somehow didnt notice that. Yes, so the latest version gives reasonable accuracy on my computer now.

Here is a comparison between the old kernels, in master, and the new kernels, on an NVIDIA 940M:

master branch:

after epoch 2 11755 ms
 training loss: 5693.41
 train accuracy: 58236/60000 97.06%
test accuracy: 9868/10000 98.68%
after tests 641 ms
record epoch=2
wrote weights to file, filesize 173KB

fsword73-kernel1 branch:

after epoch 2 12256 ms
 training loss: 5543.3
 train accuracy: 58261/60000 97.1017%
test accuracy: 9827/10000 98.27%
after tests 731 ms
record epoch=2
wrote weights to file, filesize 173KB

The accuracies are the same. Thoughts on the batch time?

fsword73 commented 8 years ago

There is 500 ms lost on a small chip such as 940m. Shall we consider big chip > 2000 cuda core only?

I can setup my nvidia titanx and 750ti next week if possible.