haroonsyed / Rust-Machine-Learning

ML/DL Algorithms from scratch...in rust & cuda 😎
0 stars 0 forks source link

Cuda Graphs! #34

Open haroonsyed opened 11 months ago

haroonsyed commented 11 months ago

So I have been in the process of implementing "packed" operations in order to hide the overhead of launching a bunch of small kernels. I just learned there is actually a solution to address this, cuda graphs!

I believe packed operations will still be faster if each operation is so small that multiple could be run in parallel on the gpu (I have no looked into graphs too much, it may allow nodes at the same level to launch in parallel which would be awesome). But this would only be a constant factor of the number of possible warps that could be launched (probably not more than 30x). However, I was noticing much worse slowdowns with increasing number of kernel launches because there was more and more accumulated overhead.

In other words, scaling for example the number of layers in a CNN was not resulting in constant time slowdowns.

I am going to continue along the path of using packed operations, since I believe it will be a valuable experience. But afterwards I do want to take a look at graphs.

https://developer.nvidia.com/blog/cuda-graphs/

haroonsyed commented 10 months ago

That being said, adding a packed version of every function is ballooning the amount of code in the matrix library...

haroonsyed commented 10 months ago

https://forums.developer.nvidia.com/t/using-multi-streams-in-cuda-graph-the-execution-order-is-uncontrolled/214672/3

Okay so according to this post, an nvidia engineer states that the cuda graph will be analyzed such that nodes with no dependencies upon one another can be launched in parallel across streams. That's awesome.

I will definitely checkout graphs, they seem much nicer/cleaner for scenarios like mine where the overhead is overtaking compute.