m4rs-mt / ILGPU

ILGPU JIT Compiler for high-performance .Net GPU programs
http://www.ilgpu.net
Other
1.39k stars 118 forks source link

Performance Issue (Reposted) #190

Closed mfagerlund closed 4 years ago

mfagerlund commented 4 years ago

Hi, I'll admit that I'm really new to ILGPU and I haven't done much GPU programming, but I'm finding it difficult to work on the performance of my system. My system (Hagrid) is an auto diffrentiating (machine learning/NN) that can perform automatic gradient descent.

I'm performing a complete rewrite from an older system that's hand coded for multi threaded CPU. My hand coded CPU version is way (1000x) faster than the GPU version and I'm unsure what to do. I think I'm stalling the accelerator stream and that I should use several streams? But I'm not sure how - how do you know what stream to send what kernel to? Or do you create one stream per kernel?

(Marcel Köster, I'm just about to read your paper https://umtl.cs.uni-saarland.de/paper_preprints/paper_koester_ptars_19.pdf to see if there's something that I can learn from there!)

Note that the NN is basically MatMuling a 150x4 matrix with a 3x4 weight - then backpropagates the results. The matrices are small, to be sure, but I'm doing a single large batch each epoch.

Performance is horrible when running GPU. When I run 100 generations of a trivial NN my old version takes 0.060s

This is the old CPU version

  0: t=0.048s, loss=0.69
 10: t=0.050s, loss=0.60
 20: t=0.051s, loss=0.57
 30: t=0.052s, loss=0.55
 40: t=0.054s, loss=0.53
 50: t=0.055s, loss=0.51
 60: t=0.056s, loss=0.49
 70: t=0.057s, loss=0.48
 80: t=0.058s, loss=0.47
 90: t=0.059s, loss=0.45
100: t=0.060s, loss=0.44

This is the ILGPU CPU version (in release mode from a nunit test)

Testing using CPUAccelerator [WarpSize: 1, MaxNumThreadsPerGroup: 32, MemorySize: 9223372036854775807]
  0: t=0.526s, loss=0.69
 10: t=0.766s, loss=0.60
 20: t=1.001s, loss=0.57
 30: t=1.305s, loss=0.55
 40: t=1.541s, loss=0.53
 50: t=1.771s, loss=0.51
 60: t=1.998s, loss=0.49
 70: t=2.287s, loss=0.48
 80: t=2.516s, loss=0.47
 90: t=2.805s, loss=0.45
100: t=3.074s, loss=0.44

And this, horror of horrors, is the GPU version;

Testing using GeForce GTX 1080 [WarpSize: 32, MaxNumThreadsPerGroup: 1024, MemorySize: 8589934592]
  0: t=0.928s, loss=0.69
 10: t=4.996s, loss=0.60
 20: t=7.871s, loss=0.57
 30: t=14.697s, loss=0.55
 40: t=21.338s, loss=0.53
 50: t=28.246s, loss=0.51
 60: t=34.979s, loss=0.49
 70: t=41.55s, loss=0.48
 80: t=49.227s, loss=0.47
 90: t=56.088s, loss=0.45
100: t=63.312s, loss=0.44

Yes, 63 seconds - that's 1000 times slower. Clearly I'm doing something wrong - I have been able to run small tight kernels at full speed and the first couple of epoch can run fairly fast, then it chokes.

I added very detailed logging to see which of my kernels were slow - but they're all slow. I'm thinking it's because they get queued up waiting for the previous one?

Here are the timings from my kernels;

** ILGPU Cuda TIMINGS **
                                   Add2d : count=    101, min=    0ms, max=   40ms, avg=  27.3663ms, total=  2764.00ms
                      BinaryCrossentropy : count=    101, min=    0ms, max=   48ms, avg=  32.4059ms, total=  3273.00ms
                                   Clear : count=   1313, min=    0ms, max=   57ms, avg=   4.3595ms, total=  5724.00ms
                               MatMul-22 : count=    202, min=    0ms, max=  169ms, avg=  15.5990ms, total=  3151.00ms
                    PointMul-ElementWise : count=    303, min=    0ms, max=   45ms, avg=  16.0990ms, total=  4878.00ms
                                   Scale : count=    404, min=    0ms, max=   47ms, avg=  13.9678ms, total=  5643.00ms
                        ScaledAddInPlace : count=   1111, min=    0ms, max=   45ms, avg=   5.9874ms, total=  6652.00ms
                            ScaleInPlace : count=    101, min=    0ms, max=   44ms, avg=  28.6931ms, total=  2898.00ms
                                 Sigmoid : count=    101, min=    0ms, max=   42ms, avg=  36.9010ms, total=  3727.00ms
                         Sub-ElementWise : count=    101, min=    0ms, max=   43ms, avg=  32.6436ms, total=  3297.00ms
                                   Sum01 : count=    101, min=    0ms, max=   43ms, avg=  32.4753ms, total=  3280.00ms
                                   Sum10 : count=    101, min=    0ms, max=   42ms, avg=  28.6337ms, total=  2892.00ms
                              Sum-Reduce : count=    202, min=    0ms, max=  118ms, avg=  25.0545ms, total=  5061.00ms
                           Unsum-Unsum1d : count=    101, min=    0ms, max=   42ms, avg=  33.2772ms, total=  3361.00ms
                           Unsum-Unsum2d : count=    101, min=    0ms, max=   56ms, avg=  29.9505ms, total=  3025.00ms
Crude Sum (may contain duplicated counts): total=59626.00ms

Here are the timings when running in CPU

** TIMINGS **
                                   Add2d : count=    101, min=    0ms, max=   11ms, avg=   0.2772ms, total=    28.00ms
                      BinaryCrossentropy : count=    101, min=    0ms, max=    9ms, avg=   0.2079ms, total=    21.00ms
                                   Clear : count=   1313, min=    0ms, max=    4ms, avg=   0.0076ms, total=    10.00ms
                               MatMul-22 : count=    202, min=    0ms, max=  312ms, avg=   1.6337ms, total=   330.00ms
                    PointMul-ElementWise : count=    303, min=    0ms, max=    9ms, avg=   0.0660ms, total=    20.00ms
                                   Scale : count=    404, min=    0ms, max=    4ms, avg=   0.0619ms, total=    25.00ms
                        ScaledAddInPlace : count=   1111, min=    0ms, max=    3ms, avg=   0.0153ms, total=    17.00ms
                            ScaleInPlace : count=    101, min=    0ms, max=    3ms, avg=   0.1287ms, total=    13.00ms
                                 Sigmoid : count=    101, min=    0ms, max=    4ms, avg=   0.1683ms, total=    17.00ms
                         Sub-ElementWise : count=    101, min=    0ms, max=    9ms, avg=   0.1881ms, total=    19.00ms
                                   Sum01 : count=    101, min=    0ms, max=    9ms, avg=   0.1980ms, total=    20.00ms
                                   Sum10 : count=    101, min=    0ms, max=   15ms, avg=   0.3267ms, total=    33.00ms
                              Sum-Reduce : count=    202, min=    2ms, max=   63ms, avg=   5.5297ms, total=  1117.00ms
                           Unsum-Unsum1d : count=    101, min=    0ms, max=    3ms, avg=   0.1287ms, total=    13.00ms
                           Unsum-Unsum2d : count=    101, min=    0ms, max=    9ms, avg=   0.2772ms, total=    28.00ms
Crude Sum (may contain duplicated counts): total=1711.00ms

As you can see, even element wise Sigmoid, which takes 17ms for 101 runs on CPU, takes 3727ms on GPU. Ouch.

Any insights would be more than welcome, where should I start looking? If anyone wants to have a look at the code, I can upload it to GitHub.

cheers, /mattias

m4rs-mt commented 4 years ago

@mfagerlund Thank you for your question. This reads like you are trying to use a CPU-driven algorithm on a GPU. Usually, I would expect a speedup of around 25X to 30X using matrix-matrix multiplications on a GPU.

I think I'm stalling the accelerator stream and that I should use several streams?

Hmm this may be the case. However, you can simply attach all kernels (if they should be executed one after another) to the same accelerator stream and synchronize the stream in the end before fetching the results.

I added very detailed logging to see which of my kernels were slow - but they're all slow. I'm thinking it's because they get queued up waiting for the previous one?

I personally don't think that this is the problem in this case. Are the kernels accidentally executed with DebugInformation turned on? Are you accidentally using math operations that involve so many double calculations on your consumer GPU? Maybe your scheduled GPU grids are inadvertently too large, and you are calculating much more than you would like to calculate? Do you call various methods in your kernel that may not be inlined?

As you can see, even element wise Sigmoid, which takes 17ms for 101 runs on CPU, takes 3727ms on GPU. Ouch.

Hmm, this is really strange and points to a major problem 😟. Maybe the kernel is recompiled every time? That would explain the dramatic slowdowns in each iteration...

Regarding the paper, I would recommend these two: FANG and FENNS.

mfagerlund commented 4 years ago

Hi, thanks for your reply!

I've done further investigations and I don't think I'm running with any DebugInformation turned on, I made sure to do a clean rebuild in release mode. I'm running these experiments from the Resharper test runner inside VS, which turned out to be part of the problem.

In an attempt to isolate the issue I created a console application, and for the first number of epochs, performance is exactly where I was hoping it would be. In the previous examples, t is the accumulated time, so they're supposed to increase. In the report below, t is the time (in milliseconds) per epoch. The first few are all over the place, but that's fine, it's probably some kind of warm-up issue.

Test run inside console application:

  0: t=722.4534ms, loss=0.6890
  1: t=0.8668ms, loss=0.6624
  2: t=435.8197ms, loss=0.6452
  3: t=121.3614ms, loss=0.6336
  4: t=0.7804ms, loss=0.6253
  5: t=0.7157ms, loss=0.6190
  6: t=0.6829ms, loss=0.6140
  7: t=0.6800ms, loss=0.6096
...
 43: t=0.7336ms, loss=0.5208
 44: t=0.8864ms, loss=0.5189
 45: t=652.0837ms, loss=0.5171
 46: t=707.7748ms, loss=0.5153
 47: t=869.9794ms, loss=0.5135

But as yo notice, from epoch/iteration 4 to iteration 44, performance is right were I assumed it would be. It's still slower than my hand optimized CPU code, but for such small problems, I always assumed it would be. I'm targeting matrices that are hundred of times larger than this.

At epoch 45, things go off the rails and a major slowdown hits, and it stays slow from there on in. I have no clue what happens there, I'm executing the same code for the entire process. I will try to boil it down to something that can be tested without digging into my code, I'm not sure how that will work out though.

Another HUGE issue I notices that when I run the EXACT same code inside a test, using the test runner, every epoch is slow, the epochs that take less than one ms take 30ms-50ms in the test runner, so maybe DebugInformation is added? But then it hits some kind of barrier and epochs start taking >200ms.

Amusingly, and confusingly, there are a handful of epochs that take less than 1ms in this run as well. That makes no sense to me

Tests run inside Resharper Test Runner

Testing using GeForce GTX 1080 [WarpSize: 32, MaxNumThreadsPerGroup: 1024, MemorySize: 8589934592]
0: t=581.5327ms, loss=0.6890
  1: t=55.8659ms, loss=0.6624
  2: t=43.4641ms, loss=0.6452
  3: t=46.7345ms, loss=0.6336
  4: t=476.2855ms, loss=0.6253
  5: t=500.3292ms, loss=0.6190
  6: t=396.7584ms, loss=0.6140
  7: t=428.5810ms, loss=0.6096
  8: t=382.8505ms, loss=0.6058
  9: t=194.9591ms, loss=0.6023
 10: t=391.9976ms, loss=0.5990
 11: t=258.7813ms, loss=0.5959
 12: t=381.8248ms, loss=0.5929
 13: t=306.4313ms, loss=0.5900
 14: t=226.3837ms, loss=0.5871
 15: t=117.4725ms, loss=0.5843
 16: t=39.4972ms, loss=0.5816
 17: t=192.9727ms, loss=0.5789
 18: t=192.3695ms, loss=0.5763
 19: t=153.6137ms, loss=0.5737
 20: t=190.6053ms, loss=0.5711
 21: t=189.9700ms, loss=0.5686
 22: t=0.7525ms, loss=0.5661
 23: t=186.6649ms, loss=0.5636
 24: t=189.1113ms, loss=0.5612
 25: t=186.5434ms, loss=0.5588
 26: t=186.8467ms, loss=0.5564
 27: t=0.8146ms, loss=0.5541
 28: t=97.0543ms, loss=0.5518
 29: t=152.9802ms, loss=0.5495
 30: t=84.0762ms, loss=0.5473
 31: t=187.4957ms, loss=0.5451
 32: t=189.4157ms, loss=0.5429
 33: t=187.9267ms, loss=0.5407
 34: t=149.4873ms, loss=0.5386
 35: t=158.3347ms, loss=0.5365
 36: t=0.7756ms, loss=0.5345
 37: t=149.4425ms, loss=0.5324
 38: t=74.6124ms, loss=0.5304
 39: t=39.9844ms, loss=0.5284
 40: t=151.0965ms, loss=0.5265
 41: t=148.4733ms, loss=0.5246
 42: t=1.5060ms, loss=0.5226
 43: t=149.3188ms, loss=0.5208
 44: t=0.8417ms, loss=0.5189
 45: t=150.6350ms, loss=0.5171
 46: t=147.9778ms, loss=0.5153
 47: t=43.2622ms, loss=0.5135
 48: t=110.7232ms, loss=0.5117
 49: t=43.6789ms, loss=0.5100
 50: t=148.5639ms, loss=0.5083
 51: t=0.7957ms, loss=0.5066
 52: t=148.9574ms, loss=0.5049
 53: t=146.0436ms, loss=0.5032
 54: t=116.3142ms, loss=0.5016
 55: t=0.7833ms, loss=0.5000
 56: t=148.0141ms, loss=0.4984
 57: t=150.2592ms, loss=0.4968
 58: t=149.0304ms, loss=0.4953
 59: t=74.1664ms, loss=0.4938
 60: t=74.5905ms, loss=0.4922

It seems I have two problems;

Firstly, performance drops after a set number of epochs and as far as I can tell, I'm doing the same thing for each epoch. I'll dig deeper but if you have any suggestions that'd be more than welcome.

Secondly, I can't run my tests inside the Resharper Test Runner - I'm leaving that for now, unless you can suggest a clever way around it, and maybe a way to detect if DebugInformation is turned on?

/mattias

mfagerlund commented 4 years ago

I just tested something else, if run process twice in a row in the same console application, it starts out working fine before it starts slowing down. So I tested disposing the context when the slowdown happens, and that helps. Only disposing the Accelerator also works, but for a shorter time. Very confusing.

Disposing entire context:

  0: t=  784.4998ms, loss=0.6890
  1: t=    0.8189ms, loss=0.6624
  2: t=   40.9501ms, loss=0.6452
  3: t=   15.3721ms, loss=0.6336
  4: t=    0.7803ms, loss=0.6253
  5: t=    0.9254ms, loss=0.6190
  6: t=    0.8010ms, loss=0.6140
  7: t=    0.6744ms, loss=0.6096
  8: t=    0.6631ms, loss=0.6058
  9: t=    0.6978ms, loss=0.6023
 10: t=    0.6755ms, loss=0.5990
 11: t=    0.6813ms, loss=0.5959
 12: t=    0.6852ms, loss=0.5929
 13: t=    0.6853ms, loss=0.5900
 14: t=    0.7563ms, loss=0.5871
 15: t=    0.7738ms, loss=0.5843
 16: t=    0.6777ms, loss=0.5816
 17: t=    0.7230ms, loss=0.5789
 18: t=    0.7107ms, loss=0.5763
 19: t=    0.6894ms, loss=0.5737
 20: t=    0.6964ms, loss=0.5711
 21: t=    0.8294ms, loss=0.5686
 22: t=    0.7790ms, loss=0.5661
 23: t=    0.7042ms, loss=0.5636
 24: t=    0.7726ms, loss=0.5612
 25: t=    0.7396ms, loss=0.5588
 26: t=    0.6626ms, loss=0.5564
 27: t=    0.6623ms, loss=0.5541
 28: t=    1.0733ms, loss=0.5518
 29: t=    0.7085ms, loss=0.5495
 30: t=    0.9126ms, loss=0.5473
 31: t=    0.7926ms, loss=0.5451
 32: t=    0.9395ms, loss=0.5429
 33: t=    0.9089ms, loss=0.5407
 34: t=    0.8446ms, loss=0.5386
 35: t=    0.7910ms, loss=0.5365
 36: t=    0.7414ms, loss=0.5345
 37: t=    0.7891ms, loss=0.5324
 38: t=    0.7689ms, loss=0.5304
 39: t=    0.6968ms, loss=0.5284
 40: t=    0.7296ms, loss=0.5265
 41: t=    0.7116ms, loss=0.5246
 42: t=    0.7107ms, loss=0.5226
 43: t=    0.7023ms, loss=0.5208
 44: t=   11.8334ms, loss=0.5189
 45: t=   70.8817ms, loss=0.5171
 46: t=  193.4629ms, loss=0.5153
 47: t=  191.4616ms, loss=0.5135
 48: t=  775.7186ms, loss=0.5117
RESETTING CONTEXT AND ACCELERATOR
 49: t=  828.9610ms, loss=0.5100
 50: t=  165.3058ms, loss=0.5083
 51: t=    0.7535ms, loss=0.5066
 52: t=    0.7391ms, loss=0.5049
 53: t=    0.6522ms, loss=0.5032
 54: t=    0.6454ms, loss=0.5016
 55: t=    0.7446ms, loss=0.5000
 56: t=    0.6636ms, loss=0.4984
 57: t=    0.6769ms, loss=0.4968
 58: t=    0.6772ms, loss=0.4953
 59: t=    0.7167ms, loss=0.4938
 60: t=    0.6726ms, loss=0.4922
 61: t=    0.6422ms, loss=0.4908
 62: t=    0.7113ms, loss=0.4893
 63: t=    0.6815ms, loss=0.4878
 64: t=    0.6680ms, loss=0.4864
 65: t=    0.7749ms, loss=0.4850
 66: t=    0.7714ms, loss=0.4836
 67: t=    0.6869ms, loss=0.4822
 68: t=    0.9391ms, loss=0.4808
 69: t=  202.9290ms, loss=0.4794
RESETTING CONTEXT AND ACCELERATOR
 70: t=  774.1387ms, loss=0.4781
 71: t=   82.7102ms, loss=0.4768
 72: t=    0.7197ms, loss=0.4755
 73: t=    0.6749ms, loss=0.4742
 74: t=    0.8573ms, loss=0.4729
 75: t=    0.7813ms, loss=0.4716
 76: t=    0.6956ms, loss=0.4704
 77: t=    0.7414ms, loss=0.4692
 78: t=    0.7890ms, loss=0.4679
 79: t=    0.6916ms, loss=0.4667
 80: t=    0.6519ms, loss=0.4655
 81: t=    0.8536ms, loss=0.4644
 82: t=    0.7963ms, loss=0.4632
 83: t=    0.6917ms, loss=0.4621
 84: t=    0.8135ms, loss=0.4609
 85: t=    0.7928ms, loss=0.4598
 86: t=    0.7853ms, loss=0.4587
 87: t=    0.7441ms, loss=0.4576
 88: t=    0.7036ms, loss=0.4565
 89: t=    0.7252ms, loss=0.4554
 90: t=    0.7062ms, loss=0.4544
 91: t=    0.7865ms, loss=0.4533
 92: t=    0.7337ms, loss=0.4523
 93: t=    0.7984ms, loss=0.4512
 94: t=    0.9112ms, loss=0.4502
 95: t=    0.7973ms, loss=0.4492
 96: t=    0.7870ms, loss=0.4482
 97: t=    0.7075ms, loss=0.4472
 98: t=    0.7280ms, loss=0.4463
 99: t=    0.8056ms, loss=0.4453
100: t=    0.7081ms, loss=0.4443

Disposing accelerator:

  0: t=  845.4114ms, loss=0.6890
  1: t=    0.8786ms, loss=0.6624
  2: t=  193.0794ms, loss=0.6452
  3: t=   15.8713ms, loss=0.6336
  4: t=    1.0068ms, loss=0.6253
  5: t=    0.7737ms, loss=0.6190
  6: t=    0.7586ms, loss=0.6140
  7: t=    0.6905ms, loss=0.6096
  8: t=    0.6863ms, loss=0.6058
  9: t=    0.6805ms, loss=0.6023
 10: t=    0.8283ms, loss=0.5990
 11: t=    0.7074ms, loss=0.5959
 12: t=    1.0284ms, loss=0.5929
 13: t=    0.7750ms, loss=0.5900
 14: t=    0.6848ms, loss=0.5871
 15: t=    0.7586ms, loss=0.5843
 16: t=    0.7201ms, loss=0.5816
 17: t=    0.6827ms, loss=0.5789
 18: t=    0.7001ms, loss=0.5763
 19: t=    0.6965ms, loss=0.5737
 20: t=    0.6713ms, loss=0.5711
 21: t=    0.6734ms, loss=0.5686
 22: t=    0.6824ms, loss=0.5661
 23: t=    0.7040ms, loss=0.5636
 24: t=    0.7307ms, loss=0.5612
 25: t=    0.7882ms, loss=0.5588
 26: t=    0.6902ms, loss=0.5564
 27: t=    1.1721ms, loss=0.5541
 28: t=    0.8835ms, loss=0.5518
 29: t=    0.7690ms, loss=0.5495
 30: t=    0.7465ms, loss=0.5473
 31: t=    0.7083ms, loss=0.5451
 32: t=    1.0223ms, loss=0.5429
 33: t=    0.7342ms, loss=0.5407
 34: t=    0.8352ms, loss=0.5386
 35: t=    0.7248ms, loss=0.5365
 36: t=    1.0034ms, loss=0.5345
 37: t=    0.9484ms, loss=0.5324
 38: t=    0.7576ms, loss=0.5304
 39: t=    1.0458ms, loss=0.5284
 40: t=    0.7601ms, loss=0.5265
 41: t=    0.7100ms, loss=0.5246
 42: t=    0.9251ms, loss=0.5226
 43: t=    0.7217ms, loss=0.5208
 44: t=   91.9540ms, loss=0.5189
 45: t=  786.0746ms, loss=0.5171
RESETTING ACCELERATOR
 46: t=  788.1626ms, loss=0.5153
 47: t=  123.3696ms, loss=0.5135
 48: t=    0.8424ms, loss=0.5117
 49: t=    1.1369ms, loss=0.5100
 50: t=    0.7085ms, loss=0.5083
 51: t=    0.6685ms, loss=0.5066
 52: t=    0.6529ms, loss=0.5049
 53: t=    0.6533ms, loss=0.5032
 54: t=    0.9769ms, loss=0.5016
 55: t=    0.8116ms, loss=0.5000
 56: t=    0.7136ms, loss=0.4984
 57: t=    0.7007ms, loss=0.4968
 58: t=    0.7042ms, loss=0.4953
 59: t=    0.6730ms, loss=0.4938
 60: t=    0.7461ms, loss=0.4922
 61: t=    0.8517ms, loss=0.4908
 62: t=    0.6955ms, loss=0.4893
 63: t=    0.6702ms, loss=0.4878
 64: t=    0.8668ms, loss=0.4864
 65: t=    0.6917ms, loss=0.4850
 66: t=    0.7539ms, loss=0.4836
 67: t=  382.5853ms, loss=0.4822
RESETTING ACCELERATOR
 68: t=  771.6581ms, loss=0.4808
 69: t=  121.2338ms, loss=0.4794
 70: t=    0.7461ms, loss=0.4781
 71: t=    0.6759ms, loss=0.4768
 72: t=    0.7345ms, loss=0.4755
 73: t=    0.7103ms, loss=0.4742
 74: t=    0.6846ms, loss=0.4729
 75: t=    0.6466ms, loss=0.4716
 76: t=    0.6911ms, loss=0.4704
 77: t=    0.7243ms, loss=0.4692
 78: t=    0.7102ms, loss=0.4679
 79: t=    0.6444ms, loss=0.4667
 80: t=  517.9366ms, loss=0.4655
RESETTING ACCELERATOR
 81: t=  766.7547ms, loss=0.4644
 82: t=  122.3494ms, loss=0.4632
 83: t=    0.7614ms, loss=0.4621
 84: t=    0.7113ms, loss=0.4609
 85: t=    0.8432ms, loss=0.4598
 86: t=    0.6827ms, loss=0.4587
 87: t=    0.7170ms, loss=0.4576
 88: t=    0.6558ms, loss=0.4565
 89: t=  160.6680ms, loss=0.4554
 90: t=  514.8999ms, loss=0.4544
RESETTING ACCELERATOR
 91: t=  834.2871ms, loss=0.4533
 92: t=  391.6402ms, loss=0.4523
 93: t=    0.7541ms, loss=0.4512
 94: t=    0.7084ms, loss=0.4502
 95: t=    0.6987ms, loss=0.4492
 96: t=    0.6704ms, loss=0.4482
 97: t=    0.6735ms, loss=0.4472
 98: t=    0.6698ms, loss=0.4463
 99: t=    0.6463ms, loss=0.4453
100: t=    0.7614ms, loss=0.4443

/mattias

mfagerlund commented 4 years ago

More curious discoveries, I tried changing the flags for the context as I create it, and none seem to make any significant difference (some slow things down, but that's to be expected given what they supposedly do), but EnableVerifier makes things worse (tested over multiple runs)

  0: t=  857.6716ms, loss=0.6890
  1: t=   12.0856ms, loss=0.6624
  2: t=    0.8345ms, loss=0.6452
  3: t=    0.6674ms, loss=0.6336
  4: t=    0.6780ms, loss=0.6253
  5: t=    0.6437ms, loss=0.6190
  6: t=    0.7144ms, loss=0.6140
  7: t=    0.7020ms, loss=0.6096
  8: t=    0.6499ms, loss=0.6058
  9: t=    0.7062ms, loss=0.6023
 10: t=    0.6627ms, loss=0.5990
 11: t=    0.6788ms, loss=0.5959
 12: t=    0.6532ms, loss=0.5929
 13: t=    0.6959ms, loss=0.5900
 14: t=    0.6830ms, loss=0.5871
 15: t=    0.7105ms, loss=0.5843
 16: t=    0.6819ms, loss=0.5816
 17: t=    0.6774ms, loss=0.5789
 18: t=    0.7537ms, loss=0.5763
 19: t=    0.7380ms, loss=0.5737
 20: t=    0.8696ms, loss=0.5711
 21: t=    0.7810ms, loss=0.5686
 22: t=  289.1939ms, loss=0.5661
RESETTING CONTEXT AND ACCELERATOR
 23: t=  815.1470ms, loss=0.5636
 24: t=  554.8228ms, loss=0.5612
 25: t=  503.5454ms, loss=0.5588
 26: t=  516.0331ms, loss=0.5564
 27: t=  350.6719ms, loss=0.5541
RESETTING CONTEXT AND ACCELERATOR
 28: t=  757.6347ms, loss=0.5518
 29: t=  619.3011ms, loss=0.5495
 30: t=  612.0122ms, loss=0.5473
 31: t=  600.3412ms, loss=0.5451
 32: t=  583.7251ms, loss=0.5429
RESETTING CONTEXT AND ACCELERATOR
 33: t=  853.0368ms, loss=0.5407
 34: t=  520.6782ms, loss=0.5386
 35: t=  593.3150ms, loss=0.5365
 36: t=  655.8258ms, loss=0.5345
 37: t=  481.2002ms, loss=0.5324
RESETTING CONTEXT AND ACCELERATOR

here the app crashes with the exception;
ILGPU.Runtime.Cuda.CudaException
  HResult=0x80131500
  Message=invalid argument
  Source=ILGPU
  StackTrace:
   at ILGPU.Runtime.Cuda.CudaMemoryBuffer`2.Dispose(Boolean disposing)
   at ILGPU.Util.DisposeBase.Finalize()
mfagerlund commented 4 years ago

I think I solved it, I would argue it looks like a bug in ILGPU, more details to follow, but when I started caching the kernels, not calling Accelerator.Current.LoadAutoGroupedStreamKernel for each use, BOTH problems went away. So YAY!

/m

mfagerlund commented 4 years ago

So, after having routed all my kernel creations through a static cache, timings are looking great - both when running in the console app and when running in the testrunner.

The problem with me caching the kernels, instead of the Accelerator doing it, is that I may destroy the accelerator and create a new one, and then I must also make sure to destroy all the cached kernels. If I could ask the accelerator for a kernel and have it cache the kernel, then that would be awesome, because the Accelerator is able to perform the housekeeping in a way that I can't.

The reason I didn't cache the Kernels to begin with is 1) it's hard to know when they should be destroyed and 2) the method for loading the kernels from the accelerator is actually very fast. I assumed that it was caching to begin with.

Anyway, thanks for an awesome tool, I hope I helped find an issue that might have troubled someone else in the future.

cheers, m

Console result

Testing using GeForce GTX 1080 [WarpSize: 32, MaxNumThreadsPerGroup: 1024, MemorySize: 8589934592]
Loaded 150 samples
  0: t=  827.8274ms, loss=0.6890
  1: t=    0.6130ms, loss=0.6624
  2: t=    2.9909ms, loss=0.6452
  3: t=    7.2784ms, loss=0.6336
  4: t=    0.6378ms, loss=0.6253
  5: t=    0.4987ms, loss=0.6190
  6: t=    0.5512ms, loss=0.6140
  7: t=    0.4629ms, loss=0.6096
  8: t=    0.4746ms, loss=0.6058
  9: t=    0.4781ms, loss=0.6023
 10: t=    0.5904ms, loss=0.5990
 11: t=    0.6889ms, loss=0.5959
 12: t=    0.5587ms, loss=0.5929
 13: t=    0.4982ms, loss=0.5900
 14: t=    0.5879ms, loss=0.5871
 15: t=    0.5047ms, loss=0.5843
 16: t=    0.5102ms, loss=0.5816
 17: t=    0.5384ms, loss=0.5789
 18: t=    0.5250ms, loss=0.5763
 19: t=    0.4770ms, loss=0.5737
 20: t=    0.5387ms, loss=0.5711
 21: t=    0.6824ms, loss=0.5686
 22: t=    0.6580ms, loss=0.5661
 23: t=    0.4911ms, loss=0.5636
 24: t=    0.5646ms, loss=0.5612
 25: t=    0.5870ms, loss=0.5588
 26: t=    0.5276ms, loss=0.5564
 27: t=    0.5433ms, loss=0.5541
 28: t=    0.5559ms, loss=0.5518
 29: t=    0.5305ms, loss=0.5495
 30: t=    0.5893ms, loss=0.5473
 31: t=    0.5936ms, loss=0.5451
 32: t=    0.5491ms, loss=0.5429
 33: t=    0.5837ms, loss=0.5407
 34: t=    0.6952ms, loss=0.5386
 35: t=    0.6050ms, loss=0.5365
 36: t=    0.5137ms, loss=0.5345
 37: t=    0.5186ms, loss=0.5324
 38: t=    0.4974ms, loss=0.5304
 39: t=    0.5727ms, loss=0.5284
 40: t=    0.5132ms, loss=0.5265
 41: t=    0.5191ms, loss=0.5246
 42: t=    0.5167ms, loss=0.5226
 43: t=    0.4948ms, loss=0.5208
 44: t=    0.5811ms, loss=0.5189
 45: t=    0.7471ms, loss=0.5171
 46: t=    0.5102ms, loss=0.5153
 47: t=    0.5286ms, loss=0.5135
 48: t=    0.5055ms, loss=0.5117
 49: t=    0.4976ms, loss=0.5100
 50: t=    0.5602ms, loss=0.5083
 51: t=    0.5025ms, loss=0.5066
 52: t=    0.5137ms, loss=0.5049
 53: t=    0.5069ms, loss=0.5032
 54: t=    0.6034ms, loss=0.5016
 55: t=    1.3317ms, loss=0.5000
 56: t=    0.5341ms, loss=0.4984
 57: t=    0.5827ms, loss=0.4968
 58: t=    0.7681ms, loss=0.4953
 59: t=    0.5611ms, loss=0.4938
 60: t=    0.5152ms, loss=0.4922
 61: t=    0.5349ms, loss=0.4908
 62: t=    0.5911ms, loss=0.4893
 63: t=    1.4779ms, loss=0.4878
 64: t=    0.5332ms, loss=0.4864
 65: t=    0.5370ms, loss=0.4850
 66: t=    0.6962ms, loss=0.4836
 67: t=    0.5503ms, loss=0.4822
 68: t=    0.5547ms, loss=0.4808
 69: t=    0.5227ms, loss=0.4794
 70: t=    0.5303ms, loss=0.4781
 71: t=    0.8966ms, loss=0.4768
 72: t=    0.6373ms, loss=0.4755
 73: t=    0.5794ms, loss=0.4742
 74: t=    0.6610ms, loss=0.4729
 75: t=    0.5640ms, loss=0.4716
 76: t=    0.5283ms, loss=0.4704
 77: t=    0.6675ms, loss=0.4692
 78: t=    0.5675ms, loss=0.4679
 79: t=    0.6680ms, loss=0.4667
 80: t=    0.5469ms, loss=0.4655
 81: t=    0.5457ms, loss=0.4644
 82: t=    5.4809ms, loss=0.4632
 83: t=    9.9123ms, loss=0.4621
 84: t=    3.2119ms, loss=0.4609
 85: t=    3.0283ms, loss=0.4598
 86: t=    3.1048ms, loss=0.4587
 87: t=    3.1099ms, loss=0.4576
 88: t=    3.5918ms, loss=0.4565
 89: t=    3.3611ms, loss=0.4554
 90: t=    2.8604ms, loss=0.4544
 91: t=    3.4398ms, loss=0.4533
 92: t=    3.4455ms, loss=0.4523
 93: t=    3.8213ms, loss=0.4512
 94: t=    2.9244ms, loss=0.4502
 95: t=    2.8418ms, loss=0.4492
 96: t=    3.0769ms, loss=0.4482
 97: t=    4.9128ms, loss=0.4472
 98: t=    0.6052ms, loss=0.4463
 99: t=    0.5811ms, loss=0.4453
100: t=    0.4995ms, loss=0.4443

** TIMINGS **
SkipFirstEntry=True
               (Allocate MemoryBuffer 1) : count=    908, min=    0ms, max=    1ms, avg=   0.0148ms, total=    13.47ms
              (Allocate MemoryBuffer 12) : count=    202, min=    0ms, max=    0ms, avg=   0.0143ms, total=     2.88ms
             (Allocate MemoryBuffer 150) : count=    807, min=    0ms, max=    1ms, avg=   0.0151ms, total=    12.20ms
               (Allocate MemoryBuffer 3) : count=    101, min=    0ms, max=    0ms, avg=   0.0148ms, total=     1.50ms
             (Allocate MemoryBuffer 450) : count=   1111, min=    0ms, max=    1ms, avg=   0.0150ms, total=    16.72ms
             (Allocate MemoryBuffer 600) : count=    101, min=    0ms, max=    0ms, avg=   0.0143ms, total=     1.44ms
                   (Tensor.GetAsArray 1) : count=    100, min=    0ms, max=    0ms, avg=   0.0782ms, total=     7.82ms
                                   Add2d : count=    100, min=    0ms, max=    0ms, avg=   0.0154ms, total=     1.54ms
                   Allocate MemoryBuffer : count=   3235, min=    0ms, max=    1ms, avg=   0.0149ms, total=    48.32ms
                      BinaryCrossentropy : count=    100, min=    0ms, max=    0ms, avg=   0.0225ms, total=     2.25ms
                                   Clear : count=   1312, min=    0ms, max=    0ms, avg=   0.0041ms, total=     5.44ms
                               MatMul-22 : count=    201, min=    0ms, max=    0ms, avg=   0.0061ms, total=     1.22ms
                    PointMul-ElementWise : count=    302, min=    0ms, max=    0ms, avg=   0.0049ms, total=     1.47ms
                                   Scale : count=    403, min=    0ms, max=    1ms, avg=   0.0232ms, total=     9.34ms
                        ScaledAddInPlace : count=   1110, min=    0ms, max=    0ms, avg=   0.0074ms, total=     8.18ms
                            ScaleInPlace : count=    100, min=    0ms, max=    0ms, avg=   0.0088ms, total=     0.88ms
                                 Sigmoid : count=    100, min=    0ms, max=    0ms, avg=   0.0271ms, total=     2.71ms
                         Sub-ElementWise : count=    100, min=    0ms, max=    1ms, avg=   0.0102ms, total=     1.02ms
                                   Sum01 : count=    100, min=    0ms, max=    0ms, avg=   0.0041ms, total=     0.41ms
                                   Sum10 : count=    100, min=    0ms, max=    0ms, avg=   0.0047ms, total=     0.47ms
                              Sum-Reduce : count=    201, min=    0ms, max=    7ms, avg=   0.0949ms, total=    19.07ms
                       Tensor.GetAsArray : count=    100, min=    0ms, max=    0ms, avg=   0.0782ms, total=     7.82ms
                          Tensor.SetData : count=      3, min=    0ms, max=    0ms, avg=   0.0419ms, total=     0.13ms
                           Unsum-Unsum1d : count=    100, min=    0ms, max=    0ms, avg=   0.0045ms, total=     0.45ms
                           Unsum-Unsum2d : count=    100, min=    0ms, max=    0ms, avg=   0.0054ms, total=     0.54ms
Crude Sum (may contain duplicated counts): total=167.28ms

DONE! Press any key to continue.

Test Runner Result

  0: t=  606.1201ms, loss=0.6890
 10: t=    0.5291ms, loss=0.5990
 20: t=    0.4463ms, loss=0.5711
 30: t=    0.5376ms, loss=0.5473
 40: t=    0.5908ms, loss=0.5265
 50: t=    3.1788ms, loss=0.5083
 60: t=    0.4491ms, loss=0.4922
 70: t=    0.4456ms, loss=0.4781
 80: t=    0.4600ms, loss=0.4655
 90: t=    0.5271ms, loss=0.4544
100: t=    0.4618ms, loss=0.4443
MoFtZ commented 4 years ago

@mfagerlund Great work with tracking down the issue :+1:

I'm not sure if this will work for you, but in my project, I define a singleton class called GpuContext that holds my Context and Accelerator, and also a cache of my compiled kernels. This GpuContext class then gets registered using Dependency Injection so that the rest of my application can have access. That way, I only need to compile each kernel once for the lifetime of the application.

mfagerlund commented 4 years ago

Yeah, I would have done that too, but I'm building a framework so what kernels will be present is unknown. Now I run them through a cache so I've been able to solve the problem for now. Also, I'm up to... how do I count my kernels....

I'm up to 58 kernels - and I expect more, so placing them all in one class would be cumbersome ;)

MoFtZ commented 4 years ago

Ah, I see.

I used a generic cache in GpuContext rather than coding all the kernels.

The class provided a RegisterKernel method that stored as ConcurrentDictionary<KernelToken, Delegate>. NB:KernelToken is just a struct that holds a single int.

Some class e.g. SigmoidCalculator will register the kernel on startup (or first use), and get back a token. Later, when it needs to use the kernel, it will pass in the token to GetKernel to get back the kernel delegate.

mfagerlund commented 4 years ago

What I want to accomplish is to minimize the amount of boilerplate I need to write for each method, and creating some kind of separate storage for each method. So my code basically look like this;

        public static Tensor Sigmoid(Tensor input) => KernelExecutor.Execute(MethodBase.GetCurrentMethod().Name, SigmoidKernel, input);
        public static Tensor Tanh(Tensor input) => KernelExecutor.Execute(MethodBase.GetCurrentMethod().Name, TanhKernel, input);

The name is of the kernel is passed to the executor for logging/performance measurement reasons. KernelExecutor is responsible for caching and executing the method.

        public static Dictionary<object, object> KernelCache { get; private set; } = new Dictionary<object, object>();

        public static TDelegate GetOrLoadKernel<TDelegate>(string kernelName, TDelegate action, Func<TDelegate> loadFunc)
        {
            return (TDelegate)KernelCache.GetOrCreate(action, () => loadFunc());
        }

   internal static Tensor Execute(string kernelName, Action<Index1, ArrayView<float>, ArrayView<float>> action, Tensor indata)
        {
            var sw = Stopwatch.StartNew();
            var kernel = GetOrLoadKernel(kernelName, action, () => Accelerator.Current.LoadAutoGroupedStreamKernel(action));
            var result = new Tensor(indata.Shape, false).SetName(kernelName);
            kernel(indata.MemoryBuffer.Extent, indata.MemoryBuffer, result.MemoryBuffer);
            TensorTimings.AddTiming(kernelName, sw);
            return result;
        }

Note how the cache is a Dictionary<object,object>, which is kinda brutal, but I didn't want to create one cache per each kernel call type.

To be honest, my code doesn't look like that, because I need to store gradients in the process, so Sigmoid actually looks like this;

      private static void SigmoidKernel(Index1 index, ArrayView<float> indata, ArrayView<float> result) { result[index] = 1 / (1 + XMath.Exp(-indata[index])); }
        private static void SigmoidGradientKernel(Index1 index, ArrayView<float> grad, ArrayView<float> res, ArrayView<float> result) { result[index] = grad[index] * res[index] * (1 - res[index]); }
        public static Tensor Sigmoid(Tensor input)
        {
            var result = KernelExecutor.Execute(MethodBase.GetCurrentMethod().Name, SigmoidKernel, input);
            GradientTape.Tape(input, result,
                (GradientTape.TensorGradient ainput, GradientTape.TensorGradient aresult) =>
                {
                    if (ainput.SourceWantsGradients)
                    {
                        ainput.AddGradient(aresult, KernelExecutor.Execute("SigmoidGradient", SigmoidGradientKernel, aresult.Gradient, result));
                        //ainput.AddGradient(aresult, aresult.Gradient * result * (1 - result));
                    }
                });

            return result;
        }

GetOrCreate is a helper extension method that looks like this

        public static TValue GetOrCreate<TKey, TValue>(this Dictionary<TKey, TValue> dictionary, TKey key, Func<TValue> constructor)
        {
            if (!dictionary.TryGetValue(key, out TValue value))
            {
                value = constructor();
                dictionary.Add(key, value);
            }
            return value;
        }
MoFtZ commented 4 years ago

OK, I have an idea that might work, but I'll need some time to try it out.

In the meantime, you should have a look at these attributes. In particular, CallerMethodNameAttribute should simplify the need to pass in the method name each time.

mfagerlund commented 4 years ago

Actually, perhaps I should try to create a better/tighter piece of code that demonstrates the error. The error is more important than ILGPU caching the methods - if the error (and the performance degrading) didn't happen, it would probably be just fine for 99.5% of all users.

mfagerlund commented 4 years ago

I opened a new issue that's more specific to the actual problem: https://github.com/m4rs-mt/ILGPU/issues/192

MoFtZ, thanks for your help!

MoFtZ commented 4 years ago

@mfagerlund OK, here is the KernelExecutor class that I was thinking about.

It uses a dictionary to store kernels, using the method itself as the dictionary key.

The Execute method that takes a static method as input, and returns a "simplified kernel". This simplified kernel is of the form Func<Input, Output> or Func<Input1, Input2, Output>.

KernelExecutor.zip

Mattias-NCVIB commented 4 years ago

Nice - what's the advantage of passing out funcs from the method instead of executing the code? As opposed to this;

  public static MemoryBuffer<float> Execute2(
            Action<Index1, ArrayView<float>, ArrayView<float>> action,
            MemoryBuffer<float> input1, [CallerMemberName] string callerMethodName = "")
        {
            Console.WriteLine($"{action.Method.Name} called from {callerMethodName}");
            var kernel = GetOrAddKernel(action);
            var results = Accelerator.Allocate<float>(input1.Length);
            kernel(input1.Length, input1.View, results.View);
            Accelerator.Synchronize();
            return results;
        }

It looks like you're preparing to cache the result of the call by passing back a func - but it's allready cached and a second (third?) level of caching won't really help you?

Also, ya'll love to synchronize - I've found that I don't need to synchronize at all - I think that as long as you're working on a single stream, any data read you performed will be synchronized anyway so calling Synchronize will just stall the pipeline. Just keep stuffing the pipeline until you're ready to read, and then the framework will synchronize anyway. Or am I missing something? If you're able to use multiple piplines, then things change a lot.

When adding new kernels, I use this method which provides type safetly, making sure that the MethodInfo matches the Func that's used if the kernel isn't found;

    public static TDelegate GetOrLoadKernel<TDelegate>(string kernelName, TDelegate action, Func<TDelegate> loadFunc)
        {
            return (TDelegate)KernelCache.GetOrCreate(action, () => loadFunc());
        }

(KernelCache is a Dictionary<object,object>)

In this (allowed) call to your method, things will fail at run time because the load action doesn't return the type of action that is expected and that won't be detected until run time?

   public static Func<MemoryBuffer<float>, MemoryBuffer<float>> Execute(
            Action<Index1, ArrayView<float>, ArrayView<float>, float> action,
            [CallerMemberName] string callerMethodName = "")
        {
            Console.WriteLine($"{action.Method.Name} called from {callerMethodName}");

            var kernel = GetOrAddKernel(action);
            return input1 =>
            {
                var results = Accelerator.Allocate<float>(input1.Length);
                kernel(input1.Length, input1.View, results.View, -5);
                Accelerator.Synchronize();
                return results;
            };
        }

Interesting stuff, thanks!

/m

m4rs-mt commented 4 years ago

@Mattias-NCVIB @MoFtZ I also often use the [CallerMemberName] attribute since it will be resolved at compile time and can conveniently be mapped to the actual caller 🤓

MoFtZ commented 4 years ago

@mfagerlund Your example works too. I was using the Func so that the caller could treat it like a method call, without having to supply the kernel.

And good pick up with using TDelegate, makes it less error prone.

In Cuda, streams are a FIFO queue - if you don't specify a stream, you'll use the default stream. The reason for calling Synchronize is so that when the function returns the MemoryBuffer<float>, the contents are guaranteed to be filled with the result. Otherwise, if the GPU is busy, the buffer might be uninitialised. Note that calling MemoryBuffer.GetAsArray() internally calls Synchronize, so I often rely on that behaviour rather than explicitly calling Synchronize.

The reason for using multiple streams is so that the GPU can parallelize operations. A quick search found me this slideshow: https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf