Closed mfagerlund closed 4 years ago
@mfagerlund Thank you for your question. This reads like you are trying to use a CPU-driven algorithm on a GPU. Usually, I would expect a speedup of around 25X to 30X using matrix-matrix multiplications on a GPU.
I think I'm stalling the accelerator stream and that I should use several streams?
Hmm this may be the case. However, you can simply attach all kernels (if they should be executed one after another) to the same accelerator stream and synchronize the stream in the end before fetching the results.
I added very detailed logging to see which of my kernels were slow - but they're all slow. I'm thinking it's because they get queued up waiting for the previous one?
I personally don't think that this is the problem in this case. Are the kernels accidentally executed with DebugInformation
turned on? Are you accidentally using math operations that involve so many double
calculations on your consumer GPU? Maybe your scheduled GPU grids are inadvertently too large, and you are calculating much more than you would like to calculate? Do you call various methods in your kernel that may not be inlined?
As you can see, even element wise Sigmoid, which takes 17ms for 101 runs on CPU, takes 3727ms on GPU. Ouch.
Hmm, this is really strange and points to a major problem 😟. Maybe the kernel is recompiled every time? That would explain the dramatic slowdowns in each iteration...
Regarding the paper, I would recommend these two: FANG and FENNS.
Hi, thanks for your reply!
I've done further investigations and I don't think I'm running with any DebugInformation turned on, I made sure to do a clean rebuild in release mode. I'm running these experiments from the Resharper test runner inside VS, which turned out to be part of the problem.
In an attempt to isolate the issue I created a console application, and for the first number of epochs, performance is exactly where I was hoping it would be. In the previous examples, t is the accumulated time, so they're supposed to increase. In the report below, t is the time (in milliseconds) per epoch. The first few are all over the place, but that's fine, it's probably some kind of warm-up issue.
Test run inside console application:
0: t=722.4534ms, loss=0.6890
1: t=0.8668ms, loss=0.6624
2: t=435.8197ms, loss=0.6452
3: t=121.3614ms, loss=0.6336
4: t=0.7804ms, loss=0.6253
5: t=0.7157ms, loss=0.6190
6: t=0.6829ms, loss=0.6140
7: t=0.6800ms, loss=0.6096
...
43: t=0.7336ms, loss=0.5208
44: t=0.8864ms, loss=0.5189
45: t=652.0837ms, loss=0.5171
46: t=707.7748ms, loss=0.5153
47: t=869.9794ms, loss=0.5135
But as yo notice, from epoch/iteration 4 to iteration 44, performance is right were I assumed it would be. It's still slower than my hand optimized CPU code, but for such small problems, I always assumed it would be. I'm targeting matrices that are hundred of times larger than this.
At epoch 45, things go off the rails and a major slowdown hits, and it stays slow from there on in. I have no clue what happens there, I'm executing the same code for the entire process. I will try to boil it down to something that can be tested without digging into my code, I'm not sure how that will work out though.
Another HUGE issue I notices that when I run the EXACT same code inside a test, using the test runner, every epoch is slow, the epochs that take less than one ms take 30ms-50ms in the test runner, so maybe DebugInformation is added? But then it hits some kind of barrier and epochs start taking >200ms.
Amusingly, and confusingly, there are a handful of epochs that take less than 1ms in this run as well. That makes no sense to me
Tests run inside Resharper Test Runner
Testing using GeForce GTX 1080 [WarpSize: 32, MaxNumThreadsPerGroup: 1024, MemorySize: 8589934592]
0: t=581.5327ms, loss=0.6890
1: t=55.8659ms, loss=0.6624
2: t=43.4641ms, loss=0.6452
3: t=46.7345ms, loss=0.6336
4: t=476.2855ms, loss=0.6253
5: t=500.3292ms, loss=0.6190
6: t=396.7584ms, loss=0.6140
7: t=428.5810ms, loss=0.6096
8: t=382.8505ms, loss=0.6058
9: t=194.9591ms, loss=0.6023
10: t=391.9976ms, loss=0.5990
11: t=258.7813ms, loss=0.5959
12: t=381.8248ms, loss=0.5929
13: t=306.4313ms, loss=0.5900
14: t=226.3837ms, loss=0.5871
15: t=117.4725ms, loss=0.5843
16: t=39.4972ms, loss=0.5816
17: t=192.9727ms, loss=0.5789
18: t=192.3695ms, loss=0.5763
19: t=153.6137ms, loss=0.5737
20: t=190.6053ms, loss=0.5711
21: t=189.9700ms, loss=0.5686
22: t=0.7525ms, loss=0.5661
23: t=186.6649ms, loss=0.5636
24: t=189.1113ms, loss=0.5612
25: t=186.5434ms, loss=0.5588
26: t=186.8467ms, loss=0.5564
27: t=0.8146ms, loss=0.5541
28: t=97.0543ms, loss=0.5518
29: t=152.9802ms, loss=0.5495
30: t=84.0762ms, loss=0.5473
31: t=187.4957ms, loss=0.5451
32: t=189.4157ms, loss=0.5429
33: t=187.9267ms, loss=0.5407
34: t=149.4873ms, loss=0.5386
35: t=158.3347ms, loss=0.5365
36: t=0.7756ms, loss=0.5345
37: t=149.4425ms, loss=0.5324
38: t=74.6124ms, loss=0.5304
39: t=39.9844ms, loss=0.5284
40: t=151.0965ms, loss=0.5265
41: t=148.4733ms, loss=0.5246
42: t=1.5060ms, loss=0.5226
43: t=149.3188ms, loss=0.5208
44: t=0.8417ms, loss=0.5189
45: t=150.6350ms, loss=0.5171
46: t=147.9778ms, loss=0.5153
47: t=43.2622ms, loss=0.5135
48: t=110.7232ms, loss=0.5117
49: t=43.6789ms, loss=0.5100
50: t=148.5639ms, loss=0.5083
51: t=0.7957ms, loss=0.5066
52: t=148.9574ms, loss=0.5049
53: t=146.0436ms, loss=0.5032
54: t=116.3142ms, loss=0.5016
55: t=0.7833ms, loss=0.5000
56: t=148.0141ms, loss=0.4984
57: t=150.2592ms, loss=0.4968
58: t=149.0304ms, loss=0.4953
59: t=74.1664ms, loss=0.4938
60: t=74.5905ms, loss=0.4922
It seems I have two problems;
Firstly, performance drops after a set number of epochs and as far as I can tell, I'm doing the same thing for each epoch. I'll dig deeper but if you have any suggestions that'd be more than welcome.
Secondly, I can't run my tests inside the Resharper Test Runner - I'm leaving that for now, unless you can suggest a clever way around it, and maybe a way to detect if DebugInformation is turned on?
/mattias
I just tested something else, if run process twice in a row in the same console application, it starts out working fine before it starts slowing down. So I tested disposing the context when the slowdown happens, and that helps. Only disposing the Accelerator also works, but for a shorter time. Very confusing.
Disposing entire context:
0: t= 784.4998ms, loss=0.6890
1: t= 0.8189ms, loss=0.6624
2: t= 40.9501ms, loss=0.6452
3: t= 15.3721ms, loss=0.6336
4: t= 0.7803ms, loss=0.6253
5: t= 0.9254ms, loss=0.6190
6: t= 0.8010ms, loss=0.6140
7: t= 0.6744ms, loss=0.6096
8: t= 0.6631ms, loss=0.6058
9: t= 0.6978ms, loss=0.6023
10: t= 0.6755ms, loss=0.5990
11: t= 0.6813ms, loss=0.5959
12: t= 0.6852ms, loss=0.5929
13: t= 0.6853ms, loss=0.5900
14: t= 0.7563ms, loss=0.5871
15: t= 0.7738ms, loss=0.5843
16: t= 0.6777ms, loss=0.5816
17: t= 0.7230ms, loss=0.5789
18: t= 0.7107ms, loss=0.5763
19: t= 0.6894ms, loss=0.5737
20: t= 0.6964ms, loss=0.5711
21: t= 0.8294ms, loss=0.5686
22: t= 0.7790ms, loss=0.5661
23: t= 0.7042ms, loss=0.5636
24: t= 0.7726ms, loss=0.5612
25: t= 0.7396ms, loss=0.5588
26: t= 0.6626ms, loss=0.5564
27: t= 0.6623ms, loss=0.5541
28: t= 1.0733ms, loss=0.5518
29: t= 0.7085ms, loss=0.5495
30: t= 0.9126ms, loss=0.5473
31: t= 0.7926ms, loss=0.5451
32: t= 0.9395ms, loss=0.5429
33: t= 0.9089ms, loss=0.5407
34: t= 0.8446ms, loss=0.5386
35: t= 0.7910ms, loss=0.5365
36: t= 0.7414ms, loss=0.5345
37: t= 0.7891ms, loss=0.5324
38: t= 0.7689ms, loss=0.5304
39: t= 0.6968ms, loss=0.5284
40: t= 0.7296ms, loss=0.5265
41: t= 0.7116ms, loss=0.5246
42: t= 0.7107ms, loss=0.5226
43: t= 0.7023ms, loss=0.5208
44: t= 11.8334ms, loss=0.5189
45: t= 70.8817ms, loss=0.5171
46: t= 193.4629ms, loss=0.5153
47: t= 191.4616ms, loss=0.5135
48: t= 775.7186ms, loss=0.5117
RESETTING CONTEXT AND ACCELERATOR
49: t= 828.9610ms, loss=0.5100
50: t= 165.3058ms, loss=0.5083
51: t= 0.7535ms, loss=0.5066
52: t= 0.7391ms, loss=0.5049
53: t= 0.6522ms, loss=0.5032
54: t= 0.6454ms, loss=0.5016
55: t= 0.7446ms, loss=0.5000
56: t= 0.6636ms, loss=0.4984
57: t= 0.6769ms, loss=0.4968
58: t= 0.6772ms, loss=0.4953
59: t= 0.7167ms, loss=0.4938
60: t= 0.6726ms, loss=0.4922
61: t= 0.6422ms, loss=0.4908
62: t= 0.7113ms, loss=0.4893
63: t= 0.6815ms, loss=0.4878
64: t= 0.6680ms, loss=0.4864
65: t= 0.7749ms, loss=0.4850
66: t= 0.7714ms, loss=0.4836
67: t= 0.6869ms, loss=0.4822
68: t= 0.9391ms, loss=0.4808
69: t= 202.9290ms, loss=0.4794
RESETTING CONTEXT AND ACCELERATOR
70: t= 774.1387ms, loss=0.4781
71: t= 82.7102ms, loss=0.4768
72: t= 0.7197ms, loss=0.4755
73: t= 0.6749ms, loss=0.4742
74: t= 0.8573ms, loss=0.4729
75: t= 0.7813ms, loss=0.4716
76: t= 0.6956ms, loss=0.4704
77: t= 0.7414ms, loss=0.4692
78: t= 0.7890ms, loss=0.4679
79: t= 0.6916ms, loss=0.4667
80: t= 0.6519ms, loss=0.4655
81: t= 0.8536ms, loss=0.4644
82: t= 0.7963ms, loss=0.4632
83: t= 0.6917ms, loss=0.4621
84: t= 0.8135ms, loss=0.4609
85: t= 0.7928ms, loss=0.4598
86: t= 0.7853ms, loss=0.4587
87: t= 0.7441ms, loss=0.4576
88: t= 0.7036ms, loss=0.4565
89: t= 0.7252ms, loss=0.4554
90: t= 0.7062ms, loss=0.4544
91: t= 0.7865ms, loss=0.4533
92: t= 0.7337ms, loss=0.4523
93: t= 0.7984ms, loss=0.4512
94: t= 0.9112ms, loss=0.4502
95: t= 0.7973ms, loss=0.4492
96: t= 0.7870ms, loss=0.4482
97: t= 0.7075ms, loss=0.4472
98: t= 0.7280ms, loss=0.4463
99: t= 0.8056ms, loss=0.4453
100: t= 0.7081ms, loss=0.4443
Disposing accelerator:
0: t= 845.4114ms, loss=0.6890
1: t= 0.8786ms, loss=0.6624
2: t= 193.0794ms, loss=0.6452
3: t= 15.8713ms, loss=0.6336
4: t= 1.0068ms, loss=0.6253
5: t= 0.7737ms, loss=0.6190
6: t= 0.7586ms, loss=0.6140
7: t= 0.6905ms, loss=0.6096
8: t= 0.6863ms, loss=0.6058
9: t= 0.6805ms, loss=0.6023
10: t= 0.8283ms, loss=0.5990
11: t= 0.7074ms, loss=0.5959
12: t= 1.0284ms, loss=0.5929
13: t= 0.7750ms, loss=0.5900
14: t= 0.6848ms, loss=0.5871
15: t= 0.7586ms, loss=0.5843
16: t= 0.7201ms, loss=0.5816
17: t= 0.6827ms, loss=0.5789
18: t= 0.7001ms, loss=0.5763
19: t= 0.6965ms, loss=0.5737
20: t= 0.6713ms, loss=0.5711
21: t= 0.6734ms, loss=0.5686
22: t= 0.6824ms, loss=0.5661
23: t= 0.7040ms, loss=0.5636
24: t= 0.7307ms, loss=0.5612
25: t= 0.7882ms, loss=0.5588
26: t= 0.6902ms, loss=0.5564
27: t= 1.1721ms, loss=0.5541
28: t= 0.8835ms, loss=0.5518
29: t= 0.7690ms, loss=0.5495
30: t= 0.7465ms, loss=0.5473
31: t= 0.7083ms, loss=0.5451
32: t= 1.0223ms, loss=0.5429
33: t= 0.7342ms, loss=0.5407
34: t= 0.8352ms, loss=0.5386
35: t= 0.7248ms, loss=0.5365
36: t= 1.0034ms, loss=0.5345
37: t= 0.9484ms, loss=0.5324
38: t= 0.7576ms, loss=0.5304
39: t= 1.0458ms, loss=0.5284
40: t= 0.7601ms, loss=0.5265
41: t= 0.7100ms, loss=0.5246
42: t= 0.9251ms, loss=0.5226
43: t= 0.7217ms, loss=0.5208
44: t= 91.9540ms, loss=0.5189
45: t= 786.0746ms, loss=0.5171
RESETTING ACCELERATOR
46: t= 788.1626ms, loss=0.5153
47: t= 123.3696ms, loss=0.5135
48: t= 0.8424ms, loss=0.5117
49: t= 1.1369ms, loss=0.5100
50: t= 0.7085ms, loss=0.5083
51: t= 0.6685ms, loss=0.5066
52: t= 0.6529ms, loss=0.5049
53: t= 0.6533ms, loss=0.5032
54: t= 0.9769ms, loss=0.5016
55: t= 0.8116ms, loss=0.5000
56: t= 0.7136ms, loss=0.4984
57: t= 0.7007ms, loss=0.4968
58: t= 0.7042ms, loss=0.4953
59: t= 0.6730ms, loss=0.4938
60: t= 0.7461ms, loss=0.4922
61: t= 0.8517ms, loss=0.4908
62: t= 0.6955ms, loss=0.4893
63: t= 0.6702ms, loss=0.4878
64: t= 0.8668ms, loss=0.4864
65: t= 0.6917ms, loss=0.4850
66: t= 0.7539ms, loss=0.4836
67: t= 382.5853ms, loss=0.4822
RESETTING ACCELERATOR
68: t= 771.6581ms, loss=0.4808
69: t= 121.2338ms, loss=0.4794
70: t= 0.7461ms, loss=0.4781
71: t= 0.6759ms, loss=0.4768
72: t= 0.7345ms, loss=0.4755
73: t= 0.7103ms, loss=0.4742
74: t= 0.6846ms, loss=0.4729
75: t= 0.6466ms, loss=0.4716
76: t= 0.6911ms, loss=0.4704
77: t= 0.7243ms, loss=0.4692
78: t= 0.7102ms, loss=0.4679
79: t= 0.6444ms, loss=0.4667
80: t= 517.9366ms, loss=0.4655
RESETTING ACCELERATOR
81: t= 766.7547ms, loss=0.4644
82: t= 122.3494ms, loss=0.4632
83: t= 0.7614ms, loss=0.4621
84: t= 0.7113ms, loss=0.4609
85: t= 0.8432ms, loss=0.4598
86: t= 0.6827ms, loss=0.4587
87: t= 0.7170ms, loss=0.4576
88: t= 0.6558ms, loss=0.4565
89: t= 160.6680ms, loss=0.4554
90: t= 514.8999ms, loss=0.4544
RESETTING ACCELERATOR
91: t= 834.2871ms, loss=0.4533
92: t= 391.6402ms, loss=0.4523
93: t= 0.7541ms, loss=0.4512
94: t= 0.7084ms, loss=0.4502
95: t= 0.6987ms, loss=0.4492
96: t= 0.6704ms, loss=0.4482
97: t= 0.6735ms, loss=0.4472
98: t= 0.6698ms, loss=0.4463
99: t= 0.6463ms, loss=0.4453
100: t= 0.7614ms, loss=0.4443
/mattias
More curious discoveries, I tried changing the flags for the context as I create it, and none seem to make any significant difference (some slow things down, but that's to be expected given what they supposedly do), but EnableVerifier makes things worse (tested over multiple runs)
0: t= 857.6716ms, loss=0.6890
1: t= 12.0856ms, loss=0.6624
2: t= 0.8345ms, loss=0.6452
3: t= 0.6674ms, loss=0.6336
4: t= 0.6780ms, loss=0.6253
5: t= 0.6437ms, loss=0.6190
6: t= 0.7144ms, loss=0.6140
7: t= 0.7020ms, loss=0.6096
8: t= 0.6499ms, loss=0.6058
9: t= 0.7062ms, loss=0.6023
10: t= 0.6627ms, loss=0.5990
11: t= 0.6788ms, loss=0.5959
12: t= 0.6532ms, loss=0.5929
13: t= 0.6959ms, loss=0.5900
14: t= 0.6830ms, loss=0.5871
15: t= 0.7105ms, loss=0.5843
16: t= 0.6819ms, loss=0.5816
17: t= 0.6774ms, loss=0.5789
18: t= 0.7537ms, loss=0.5763
19: t= 0.7380ms, loss=0.5737
20: t= 0.8696ms, loss=0.5711
21: t= 0.7810ms, loss=0.5686
22: t= 289.1939ms, loss=0.5661
RESETTING CONTEXT AND ACCELERATOR
23: t= 815.1470ms, loss=0.5636
24: t= 554.8228ms, loss=0.5612
25: t= 503.5454ms, loss=0.5588
26: t= 516.0331ms, loss=0.5564
27: t= 350.6719ms, loss=0.5541
RESETTING CONTEXT AND ACCELERATOR
28: t= 757.6347ms, loss=0.5518
29: t= 619.3011ms, loss=0.5495
30: t= 612.0122ms, loss=0.5473
31: t= 600.3412ms, loss=0.5451
32: t= 583.7251ms, loss=0.5429
RESETTING CONTEXT AND ACCELERATOR
33: t= 853.0368ms, loss=0.5407
34: t= 520.6782ms, loss=0.5386
35: t= 593.3150ms, loss=0.5365
36: t= 655.8258ms, loss=0.5345
37: t= 481.2002ms, loss=0.5324
RESETTING CONTEXT AND ACCELERATOR
here the app crashes with the exception;
ILGPU.Runtime.Cuda.CudaException
HResult=0x80131500
Message=invalid argument
Source=ILGPU
StackTrace:
at ILGPU.Runtime.Cuda.CudaMemoryBuffer`2.Dispose(Boolean disposing)
at ILGPU.Util.DisposeBase.Finalize()
I think I solved it, I would argue it looks like a bug in ILGPU, more details to follow, but when I started caching the kernels, not calling Accelerator.Current.LoadAutoGroupedStreamKernel for each use, BOTH problems went away. So YAY!
/m
So, after having routed all my kernel creations through a static cache, timings are looking great - both when running in the console app and when running in the testrunner.
The problem with me caching the kernels, instead of the Accelerator doing it, is that I may destroy the accelerator and create a new one, and then I must also make sure to destroy all the cached kernels. If I could ask the accelerator for a kernel and have it cache the kernel, then that would be awesome, because the Accelerator is able to perform the housekeeping in a way that I can't.
The reason I didn't cache the Kernels to begin with is 1) it's hard to know when they should be destroyed and 2) the method for loading the kernels from the accelerator is actually very fast. I assumed that it was caching to begin with.
Anyway, thanks for an awesome tool, I hope I helped find an issue that might have troubled someone else in the future.
cheers, m
Console result
Testing using GeForce GTX 1080 [WarpSize: 32, MaxNumThreadsPerGroup: 1024, MemorySize: 8589934592]
Loaded 150 samples
0: t= 827.8274ms, loss=0.6890
1: t= 0.6130ms, loss=0.6624
2: t= 2.9909ms, loss=0.6452
3: t= 7.2784ms, loss=0.6336
4: t= 0.6378ms, loss=0.6253
5: t= 0.4987ms, loss=0.6190
6: t= 0.5512ms, loss=0.6140
7: t= 0.4629ms, loss=0.6096
8: t= 0.4746ms, loss=0.6058
9: t= 0.4781ms, loss=0.6023
10: t= 0.5904ms, loss=0.5990
11: t= 0.6889ms, loss=0.5959
12: t= 0.5587ms, loss=0.5929
13: t= 0.4982ms, loss=0.5900
14: t= 0.5879ms, loss=0.5871
15: t= 0.5047ms, loss=0.5843
16: t= 0.5102ms, loss=0.5816
17: t= 0.5384ms, loss=0.5789
18: t= 0.5250ms, loss=0.5763
19: t= 0.4770ms, loss=0.5737
20: t= 0.5387ms, loss=0.5711
21: t= 0.6824ms, loss=0.5686
22: t= 0.6580ms, loss=0.5661
23: t= 0.4911ms, loss=0.5636
24: t= 0.5646ms, loss=0.5612
25: t= 0.5870ms, loss=0.5588
26: t= 0.5276ms, loss=0.5564
27: t= 0.5433ms, loss=0.5541
28: t= 0.5559ms, loss=0.5518
29: t= 0.5305ms, loss=0.5495
30: t= 0.5893ms, loss=0.5473
31: t= 0.5936ms, loss=0.5451
32: t= 0.5491ms, loss=0.5429
33: t= 0.5837ms, loss=0.5407
34: t= 0.6952ms, loss=0.5386
35: t= 0.6050ms, loss=0.5365
36: t= 0.5137ms, loss=0.5345
37: t= 0.5186ms, loss=0.5324
38: t= 0.4974ms, loss=0.5304
39: t= 0.5727ms, loss=0.5284
40: t= 0.5132ms, loss=0.5265
41: t= 0.5191ms, loss=0.5246
42: t= 0.5167ms, loss=0.5226
43: t= 0.4948ms, loss=0.5208
44: t= 0.5811ms, loss=0.5189
45: t= 0.7471ms, loss=0.5171
46: t= 0.5102ms, loss=0.5153
47: t= 0.5286ms, loss=0.5135
48: t= 0.5055ms, loss=0.5117
49: t= 0.4976ms, loss=0.5100
50: t= 0.5602ms, loss=0.5083
51: t= 0.5025ms, loss=0.5066
52: t= 0.5137ms, loss=0.5049
53: t= 0.5069ms, loss=0.5032
54: t= 0.6034ms, loss=0.5016
55: t= 1.3317ms, loss=0.5000
56: t= 0.5341ms, loss=0.4984
57: t= 0.5827ms, loss=0.4968
58: t= 0.7681ms, loss=0.4953
59: t= 0.5611ms, loss=0.4938
60: t= 0.5152ms, loss=0.4922
61: t= 0.5349ms, loss=0.4908
62: t= 0.5911ms, loss=0.4893
63: t= 1.4779ms, loss=0.4878
64: t= 0.5332ms, loss=0.4864
65: t= 0.5370ms, loss=0.4850
66: t= 0.6962ms, loss=0.4836
67: t= 0.5503ms, loss=0.4822
68: t= 0.5547ms, loss=0.4808
69: t= 0.5227ms, loss=0.4794
70: t= 0.5303ms, loss=0.4781
71: t= 0.8966ms, loss=0.4768
72: t= 0.6373ms, loss=0.4755
73: t= 0.5794ms, loss=0.4742
74: t= 0.6610ms, loss=0.4729
75: t= 0.5640ms, loss=0.4716
76: t= 0.5283ms, loss=0.4704
77: t= 0.6675ms, loss=0.4692
78: t= 0.5675ms, loss=0.4679
79: t= 0.6680ms, loss=0.4667
80: t= 0.5469ms, loss=0.4655
81: t= 0.5457ms, loss=0.4644
82: t= 5.4809ms, loss=0.4632
83: t= 9.9123ms, loss=0.4621
84: t= 3.2119ms, loss=0.4609
85: t= 3.0283ms, loss=0.4598
86: t= 3.1048ms, loss=0.4587
87: t= 3.1099ms, loss=0.4576
88: t= 3.5918ms, loss=0.4565
89: t= 3.3611ms, loss=0.4554
90: t= 2.8604ms, loss=0.4544
91: t= 3.4398ms, loss=0.4533
92: t= 3.4455ms, loss=0.4523
93: t= 3.8213ms, loss=0.4512
94: t= 2.9244ms, loss=0.4502
95: t= 2.8418ms, loss=0.4492
96: t= 3.0769ms, loss=0.4482
97: t= 4.9128ms, loss=0.4472
98: t= 0.6052ms, loss=0.4463
99: t= 0.5811ms, loss=0.4453
100: t= 0.4995ms, loss=0.4443
** TIMINGS **
SkipFirstEntry=True
(Allocate MemoryBuffer 1) : count= 908, min= 0ms, max= 1ms, avg= 0.0148ms, total= 13.47ms
(Allocate MemoryBuffer 12) : count= 202, min= 0ms, max= 0ms, avg= 0.0143ms, total= 2.88ms
(Allocate MemoryBuffer 150) : count= 807, min= 0ms, max= 1ms, avg= 0.0151ms, total= 12.20ms
(Allocate MemoryBuffer 3) : count= 101, min= 0ms, max= 0ms, avg= 0.0148ms, total= 1.50ms
(Allocate MemoryBuffer 450) : count= 1111, min= 0ms, max= 1ms, avg= 0.0150ms, total= 16.72ms
(Allocate MemoryBuffer 600) : count= 101, min= 0ms, max= 0ms, avg= 0.0143ms, total= 1.44ms
(Tensor.GetAsArray 1) : count= 100, min= 0ms, max= 0ms, avg= 0.0782ms, total= 7.82ms
Add2d : count= 100, min= 0ms, max= 0ms, avg= 0.0154ms, total= 1.54ms
Allocate MemoryBuffer : count= 3235, min= 0ms, max= 1ms, avg= 0.0149ms, total= 48.32ms
BinaryCrossentropy : count= 100, min= 0ms, max= 0ms, avg= 0.0225ms, total= 2.25ms
Clear : count= 1312, min= 0ms, max= 0ms, avg= 0.0041ms, total= 5.44ms
MatMul-22 : count= 201, min= 0ms, max= 0ms, avg= 0.0061ms, total= 1.22ms
PointMul-ElementWise : count= 302, min= 0ms, max= 0ms, avg= 0.0049ms, total= 1.47ms
Scale : count= 403, min= 0ms, max= 1ms, avg= 0.0232ms, total= 9.34ms
ScaledAddInPlace : count= 1110, min= 0ms, max= 0ms, avg= 0.0074ms, total= 8.18ms
ScaleInPlace : count= 100, min= 0ms, max= 0ms, avg= 0.0088ms, total= 0.88ms
Sigmoid : count= 100, min= 0ms, max= 0ms, avg= 0.0271ms, total= 2.71ms
Sub-ElementWise : count= 100, min= 0ms, max= 1ms, avg= 0.0102ms, total= 1.02ms
Sum01 : count= 100, min= 0ms, max= 0ms, avg= 0.0041ms, total= 0.41ms
Sum10 : count= 100, min= 0ms, max= 0ms, avg= 0.0047ms, total= 0.47ms
Sum-Reduce : count= 201, min= 0ms, max= 7ms, avg= 0.0949ms, total= 19.07ms
Tensor.GetAsArray : count= 100, min= 0ms, max= 0ms, avg= 0.0782ms, total= 7.82ms
Tensor.SetData : count= 3, min= 0ms, max= 0ms, avg= 0.0419ms, total= 0.13ms
Unsum-Unsum1d : count= 100, min= 0ms, max= 0ms, avg= 0.0045ms, total= 0.45ms
Unsum-Unsum2d : count= 100, min= 0ms, max= 0ms, avg= 0.0054ms, total= 0.54ms
Crude Sum (may contain duplicated counts): total=167.28ms
DONE! Press any key to continue.
Test Runner Result
0: t= 606.1201ms, loss=0.6890
10: t= 0.5291ms, loss=0.5990
20: t= 0.4463ms, loss=0.5711
30: t= 0.5376ms, loss=0.5473
40: t= 0.5908ms, loss=0.5265
50: t= 3.1788ms, loss=0.5083
60: t= 0.4491ms, loss=0.4922
70: t= 0.4456ms, loss=0.4781
80: t= 0.4600ms, loss=0.4655
90: t= 0.5271ms, loss=0.4544
100: t= 0.4618ms, loss=0.4443
@mfagerlund Great work with tracking down the issue :+1:
I'm not sure if this will work for you, but in my project, I define a singleton class called GpuContext
that holds my Context
and Accelerator
, and also a cache of my compiled kernels. This GpuContext
class then gets registered using Dependency Injection so that the rest of my application can have access. That way, I only need to compile each kernel once for the lifetime of the application.
Yeah, I would have done that too, but I'm building a framework so what kernels will be present is unknown. Now I run them through a cache so I've been able to solve the problem for now. Also, I'm up to... how do I count my kernels....
I'm up to 58 kernels - and I expect more, so placing them all in one class would be cumbersome ;)
Ah, I see.
I used a generic cache in GpuContext
rather than coding all the kernels.
The class provided a RegisterKernel
method that stored as ConcurrentDictionary<KernelToken, Delegate>
. NB:KernelToken
is just a struct that holds a single int
.
Some class e.g. SigmoidCalculator will register the kernel on startup (or first use), and get back a token. Later, when it needs to use the kernel, it will pass in the token to GetKernel
to get back the kernel delegate.
What I want to accomplish is to minimize the amount of boilerplate I need to write for each method, and creating some kind of separate storage for each method. So my code basically look like this;
public static Tensor Sigmoid(Tensor input) => KernelExecutor.Execute(MethodBase.GetCurrentMethod().Name, SigmoidKernel, input);
public static Tensor Tanh(Tensor input) => KernelExecutor.Execute(MethodBase.GetCurrentMethod().Name, TanhKernel, input);
The name is of the kernel is passed to the executor for logging/performance measurement reasons. KernelExecutor is responsible for caching and executing the method.
public static Dictionary<object, object> KernelCache { get; private set; } = new Dictionary<object, object>();
public static TDelegate GetOrLoadKernel<TDelegate>(string kernelName, TDelegate action, Func<TDelegate> loadFunc)
{
return (TDelegate)KernelCache.GetOrCreate(action, () => loadFunc());
}
internal static Tensor Execute(string kernelName, Action<Index1, ArrayView<float>, ArrayView<float>> action, Tensor indata)
{
var sw = Stopwatch.StartNew();
var kernel = GetOrLoadKernel(kernelName, action, () => Accelerator.Current.LoadAutoGroupedStreamKernel(action));
var result = new Tensor(indata.Shape, false).SetName(kernelName);
kernel(indata.MemoryBuffer.Extent, indata.MemoryBuffer, result.MemoryBuffer);
TensorTimings.AddTiming(kernelName, sw);
return result;
}
Note how the cache is a Dictionary<object,object>, which is kinda brutal, but I didn't want to create one cache per each kernel call type.
To be honest, my code doesn't look like that, because I need to store gradients in the process, so Sigmoid actually looks like this;
private static void SigmoidKernel(Index1 index, ArrayView<float> indata, ArrayView<float> result) { result[index] = 1 / (1 + XMath.Exp(-indata[index])); }
private static void SigmoidGradientKernel(Index1 index, ArrayView<float> grad, ArrayView<float> res, ArrayView<float> result) { result[index] = grad[index] * res[index] * (1 - res[index]); }
public static Tensor Sigmoid(Tensor input)
{
var result = KernelExecutor.Execute(MethodBase.GetCurrentMethod().Name, SigmoidKernel, input);
GradientTape.Tape(input, result,
(GradientTape.TensorGradient ainput, GradientTape.TensorGradient aresult) =>
{
if (ainput.SourceWantsGradients)
{
ainput.AddGradient(aresult, KernelExecutor.Execute("SigmoidGradient", SigmoidGradientKernel, aresult.Gradient, result));
//ainput.AddGradient(aresult, aresult.Gradient * result * (1 - result));
}
});
return result;
}
GetOrCreate is a helper extension method that looks like this
public static TValue GetOrCreate<TKey, TValue>(this Dictionary<TKey, TValue> dictionary, TKey key, Func<TValue> constructor)
{
if (!dictionary.TryGetValue(key, out TValue value))
{
value = constructor();
dictionary.Add(key, value);
}
return value;
}
OK, I have an idea that might work, but I'll need some time to try it out.
In the meantime, you should have a look at these attributes. In particular, CallerMethodNameAttribute
should simplify the need to pass in the method name each time.
Actually, perhaps I should try to create a better/tighter piece of code that demonstrates the error. The error is more important than ILGPU caching the methods - if the error (and the performance degrading) didn't happen, it would probably be just fine for 99.5% of all users.
I opened a new issue that's more specific to the actual problem: https://github.com/m4rs-mt/ILGPU/issues/192
MoFtZ, thanks for your help!
@mfagerlund OK, here is the KernelExecutor
class that I was thinking about.
It uses a dictionary to store kernels, using the method itself as the dictionary key.
The Execute
method that takes a static method as input, and returns a "simplified kernel". This simplified kernel is of the form Func<Input, Output>
or Func<Input1, Input2, Output>
.
Nice - what's the advantage of passing out funcs from the method instead of executing the code? As opposed to this;
public static MemoryBuffer<float> Execute2(
Action<Index1, ArrayView<float>, ArrayView<float>> action,
MemoryBuffer<float> input1, [CallerMemberName] string callerMethodName = "")
{
Console.WriteLine($"{action.Method.Name} called from {callerMethodName}");
var kernel = GetOrAddKernel(action);
var results = Accelerator.Allocate<float>(input1.Length);
kernel(input1.Length, input1.View, results.View);
Accelerator.Synchronize();
return results;
}
It looks like you're preparing to cache the result of the call by passing back a func - but it's allready cached and a second (third?) level of caching won't really help you?
Also, ya'll love to synchronize - I've found that I don't need to synchronize at all - I think that as long as you're working on a single stream, any data read you performed will be synchronized anyway so calling Synchronize will just stall the pipeline. Just keep stuffing the pipeline until you're ready to read, and then the framework will synchronize anyway. Or am I missing something? If you're able to use multiple piplines, then things change a lot.
When adding new kernels, I use this method which provides type safetly, making sure that the MethodInfo matches the Func that's used if the kernel isn't found;
public static TDelegate GetOrLoadKernel<TDelegate>(string kernelName, TDelegate action, Func<TDelegate> loadFunc)
{
return (TDelegate)KernelCache.GetOrCreate(action, () => loadFunc());
}
(KernelCache is a Dictionary<object,object>)
In this (allowed) call to your method, things will fail at run time because the load action doesn't return the type of action that is expected and that won't be detected until run time?
public static Func<MemoryBuffer<float>, MemoryBuffer<float>> Execute(
Action<Index1, ArrayView<float>, ArrayView<float>, float> action,
[CallerMemberName] string callerMethodName = "")
{
Console.WriteLine($"{action.Method.Name} called from {callerMethodName}");
var kernel = GetOrAddKernel(action);
return input1 =>
{
var results = Accelerator.Allocate<float>(input1.Length);
kernel(input1.Length, input1.View, results.View, -5);
Accelerator.Synchronize();
return results;
};
}
Interesting stuff, thanks!
/m
@Mattias-NCVIB @MoFtZ I also often use the [CallerMemberName]
attribute since it will be resolved at compile time and can conveniently be mapped to the actual caller 🤓
@mfagerlund Your example works too. I was using the Func
so that the caller could treat it like a method call, without having to supply the kernel.
And good pick up with using TDelegate
, makes it less error prone.
In Cuda, streams are a FIFO queue - if you don't specify a stream, you'll use the default stream. The reason for calling Synchronize
is so that when the function returns the MemoryBuffer<float>
, the contents are guaranteed to be filled with the result. Otherwise, if the GPU is busy, the buffer might be uninitialised. Note that calling MemoryBuffer.GetAsArray()
internally calls Synchronize
, so I often rely on that behaviour rather than explicitly calling Synchronize
.
The reason for using multiple streams is so that the GPU can parallelize operations. A quick search found me this slideshow: https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf
Hi, I'll admit that I'm really new to ILGPU and I haven't done much GPU programming, but I'm finding it difficult to work on the performance of my system. My system (Hagrid) is an auto diffrentiating (machine learning/NN) that can perform automatic gradient descent.
I'm performing a complete rewrite from an older system that's hand coded for multi threaded CPU. My hand coded CPU version is way (1000x) faster than the GPU version and I'm unsure what to do. I think I'm stalling the accelerator stream and that I should use several streams? But I'm not sure how - how do you know what stream to send what kernel to? Or do you create one stream per kernel?
(Marcel Köster, I'm just about to read your paper https://umtl.cs.uni-saarland.de/paper_preprints/paper_koester_ptars_19.pdf to see if there's something that I can learn from there!)
Note that the NN is basically MatMuling a 150x4 matrix with a 3x4 weight - then backpropagates the results. The matrices are small, to be sure, but I'm doing a single large batch each epoch.
Performance is horrible when running GPU. When I run 100 generations of a trivial NN my old version takes 0.060s
This is the old CPU version
This is the ILGPU CPU version (in release mode from a nunit test)
And this, horror of horrors, is the GPU version;
Yes, 63 seconds - that's 1000 times slower. Clearly I'm doing something wrong - I have been able to run small tight kernels at full speed and the first couple of epoch can run fairly fast, then it chokes.
I added very detailed logging to see which of my kernels were slow - but they're all slow. I'm thinking it's because they get queued up waiting for the previous one?
Here are the timings from my kernels;
Here are the timings when running in CPU
As you can see, even element wise Sigmoid, which takes 17ms for 101 runs on CPU, takes 3727ms on GPU. Ouch.
Any insights would be more than welcome, where should I start looking? If anyone wants to have a look at the code, I can upload it to GitHub.
cheers, /mattias