Hi, I'll admit that I'm really new to ILGPU and I haven't done much GPU programming, but I'm finding it difficult to work on the performance of my system. My system (Hagrid) is an auto diffrentiating (machine learning/NN) that can perform automatic gradient descent.
I'm performing a complete rewrite from an older system that's hand coded for multi threaded CPU. My hand coded CPU version is way (1000x) faster than the GPU version and I'm unsure what to do. I think I'm stalling the accelerator stream and that I should use several streams? But I'm not sure how - how do you know what stream to send what kernel to? Or do you create one stream per kernel?
Note that the NN is basically MatMuling a 150x4 matrix with a 3x4 weight - then backpropagates the results. The matrices are small, to be sure, but I'm doing a single large batch each epoch.
Performance is horrible when running GPU. When I run 100 generations of a trivial NN my old version takes 0.060s
Yes, 63 seconds - that's 1000 times slower. Clearly I'm doing something wrong - I have been able to run small tight kernels at full speed and the first couple of epoch can run fairly fast, then it chokes.
I added very detailed logging to see which of my kernels were slow - but they're all slow. I'm thinking it's because they get queued up waiting for the previous one?
Hi, I'll admit that I'm really new to ILGPU and I haven't done much GPU programming, but I'm finding it difficult to work on the performance of my system. My system (Hagrid) is an auto diffrentiating (machine learning/NN) that can perform automatic gradient descent.
I'm performing a complete rewrite from an older system that's hand coded for multi threaded CPU. My hand coded CPU version is way (1000x) faster than the GPU version and I'm unsure what to do. I think I'm stalling the accelerator stream and that I should use several streams? But I'm not sure how - how do you know what stream to send what kernel to? Or do you create one stream per kernel?
(Marcel Köster, I'm just about to read your paper https://umtl.cs.uni-saarland.de/paper_preprints/paper_koester_ptars_19.pdf to see if there's something that I can learn from there!)
Note that the NN is basically MatMuling a 150x4 matrix with a 3x4 weight - then backpropagates the results. The matrices are small, to be sure, but I'm doing a single large batch each epoch.
Performance is horrible when running GPU. When I run 100 generations of a trivial NN my old version takes 0.060s
This is the old CPU version
This is the ILGPU CPU version (in release mode from a nunit test)
And this, horror of horrors, is the GPU version;
Yes, 63 seconds - that's 1000 times slower. Clearly I'm doing something wrong - I have been able to run small tight kernels at full speed and the first couple of epoch can run fairly fast, then it chokes.
I added very detailed logging to see which of my kernels were slow - but they're all slow. I'm thinking it's because they get queued up waiting for the previous one?
Here are the timings from my kernels;
Here are the timings when running in CPU
As you can see, even element wise Sigmoid, which takes 17ms for 101 runs on CPU, takes 3727ms on GPU. Ouch.
Any insights would be more than welcome, where should I start looking? If anyone wants to have a look at the code, I can upload it to GitHub.
cheers, /mattias