Performance - Githubissues

Hi, I'll admit that I'm really new to ILGPU and I haven't done much GPU programming, but I'm finding it difficult to work on the performance of my system. My system (Hagrid) is an auto diffrentiating (machine learning/NN) that can perform automatic gradient descent.

I'm performing a complete rewrite from an older system that's hand coded for multi threaded CPU. My hand coded CPU version is way (1000x) faster than the GPU version and I'm unsure what to do. I think I'm stalling the accelerator stream and that I should use several streams? But I'm not sure how - how do you know what stream to send what kernel to? Or do you create one stream per kernel?

(Marcel Köster, I'm just about to read your paper https://umtl.cs.uni-saarland.de/paper_preprints/paper_koester_ptars_19.pdf to see if there's something that I can learn from there!)

Note that the NN is basically MatMuling a 150x4 matrix with a 3x4 weight - then backpropagates the results. The matrices are small, to be sure, but I'm doing a single large batch each epoch.

Performance is horrible when running GPU. When I run 100 generations of a trivial NN my old version takes 0.060s

This is the old CPU version

  0: t=0.048s, loss=0.69
 10: t=0.050s, loss=0.60
 20: t=0.051s, loss=0.57
 30: t=0.052s, loss=0.55
 40: t=0.054s, loss=0.53
 50: t=0.055s, loss=0.51
 60: t=0.056s, loss=0.49
 70: t=0.057s, loss=0.48
 80: t=0.058s, loss=0.47
 90: t=0.059s, loss=0.45
100: t=0.060s, loss=0.44

This is the ILGPU CPU version (in release mode from a nunit test)

Testing using CPUAccelerator [WarpSize: 1, MaxNumThreadsPerGroup: 32, MemorySize: 9223372036854775807]
  0: t=0.526s, loss=0.69
 10: t=0.766s, loss=0.60
 20: t=1.001s, loss=0.57
 30: t=1.305s, loss=0.55
 40: t=1.541s, loss=0.53
 50: t=1.771s, loss=0.51
 60: t=1.998s, loss=0.49
 70: t=2.287s, loss=0.48
 80: t=2.516s, loss=0.47
 90: t=2.805s, loss=0.45
100: t=3.074s, loss=0.44

And this, horror of horrors, is the GPU version;

Testing using GeForce GTX 1080 [WarpSize: 32, MaxNumThreadsPerGroup: 1024, MemorySize: 8589934592]
  0: t=0.928s, loss=0.69
 10: t=4.996s, loss=0.60
 20: t=7.871s, loss=0.57
 30: t=14.697s, loss=0.55
 40: t=21.338s, loss=0.53
 50: t=28.246s, loss=0.51
 60: t=34.979s, loss=0.49
 70: t=41.55s, loss=0.48
 80: t=49.227s, loss=0.47
 90: t=56.088s, loss=0.45
100: t=63.312s, loss=0.44

Yes, 63 seconds - that's 1000 times slower. Clearly I'm doing something wrong - I have been able to run small tight kernels at full speed and the first couple of epoch can run fairly fast, then it chokes.

I added very detailed logging to see which of my kernels were slow - but they're all slow. I'm thinking it's because they get queued up waiting for the previous one?

Here are the timings from my kernels;

** ILGPU Cuda TIMINGS **
                                   Add2d : count=    101, min=    0ms, max=   40ms, avg=  27.3663ms, total=  2764.00ms
                      BinaryCrossentropy : count=    101, min=    0ms, max=   48ms, avg=  32.4059ms, total=  3273.00ms
                                   Clear : count=   1313, min=    0ms, max=   57ms, avg=   4.3595ms, total=  5724.00ms
                               MatMul-22 : count=    202, min=    0ms, max=  169ms, avg=  15.5990ms, total=  3151.00ms
                    PointMul-ElementWise : count=    303, min=    0ms, max=   45ms, avg=  16.0990ms, total=  4878.00ms
                                   Scale : count=    404, min=    0ms, max=   47ms, avg=  13.9678ms, total=  5643.00ms
                        ScaledAddInPlace : count=   1111, min=    0ms, max=   45ms, avg=   5.9874ms, total=  6652.00ms
                            ScaleInPlace : count=    101, min=    0ms, max=   44ms, avg=  28.6931ms, total=  2898.00ms
                                 Sigmoid : count=    101, min=    0ms, max=   42ms, avg=  36.9010ms, total=  3727.00ms
                         Sub-ElementWise : count=    101, min=    0ms, max=   43ms, avg=  32.6436ms, total=  3297.00ms
                                   Sum01 : count=    101, min=    0ms, max=   43ms, avg=  32.4753ms, total=  3280.00ms
                                   Sum10 : count=    101, min=    0ms, max=   42ms, avg=  28.6337ms, total=  2892.00ms
                              Sum-Reduce : count=    202, min=    0ms, max=  118ms, avg=  25.0545ms, total=  5061.00ms
                           Unsum-Unsum1d : count=    101, min=    0ms, max=   42ms, avg=  33.2772ms, total=  3361.00ms
                           Unsum-Unsum2d : count=    101, min=    0ms, max=   56ms, avg=  29.9505ms, total=  3025.00ms
Crude Sum (may contain duplicated counts): total=59626.00ms

Here are the timings when running in CPU

** TIMINGS **
                                   Add2d : count=    101, min=    0ms, max=   11ms, avg=   0.2772ms, total=    28.00ms
                      BinaryCrossentropy : count=    101, min=    0ms, max=    9ms, avg=   0.2079ms, total=    21.00ms
                                   Clear : count=   1313, min=    0ms, max=    4ms, avg=   0.0076ms, total=    10.00ms
                               MatMul-22 : count=    202, min=    0ms, max=  312ms, avg=   1.6337ms, total=   330.00ms
                    PointMul-ElementWise : count=    303, min=    0ms, max=    9ms, avg=   0.0660ms, total=    20.00ms
                                   Scale : count=    404, min=    0ms, max=    4ms, avg=   0.0619ms, total=    25.00ms
                        ScaledAddInPlace : count=   1111, min=    0ms, max=    3ms, avg=   0.0153ms, total=    17.00ms
                            ScaleInPlace : count=    101, min=    0ms, max=    3ms, avg=   0.1287ms, total=    13.00ms
                                 Sigmoid : count=    101, min=    0ms, max=    4ms, avg=   0.1683ms, total=    17.00ms
                         Sub-ElementWise : count=    101, min=    0ms, max=    9ms, avg=   0.1881ms, total=    19.00ms
                                   Sum01 : count=    101, min=    0ms, max=    9ms, avg=   0.1980ms, total=    20.00ms
                                   Sum10 : count=    101, min=    0ms, max=   15ms, avg=   0.3267ms, total=    33.00ms
                              Sum-Reduce : count=    202, min=    2ms, max=   63ms, avg=   5.5297ms, total=  1117.00ms
                           Unsum-Unsum1d : count=    101, min=    0ms, max=    3ms, avg=   0.1287ms, total=    13.00ms
                           Unsum-Unsum2d : count=    101, min=    0ms, max=    9ms, avg=   0.2772ms, total=    28.00ms
Crude Sum (may contain duplicated counts): total=1711.00ms

As you can see, even element wise Sigmoid, which takes 17ms for 101 runs on CPU, takes 3727ms on GPU. Ouch.

Any insights would be more than welcome, where should I start looking? If anyone wants to have a look at the code, I can upload it to GitHub.

cheers, /mattias

m4rs-mt / ILGPU

Performance #189