Open seunghwak opened 2 years ago
@seunghwak have you prototyped this with raw CUDA streams to verify that you get the benefits you expect?
@seunghwak have you prototyped this with raw CUDA streams to verify that you get the benefits you expect?
Yes, this was used to produced the PageRank demo results for GTC 2020. I'm working on bringing this optimization to cugraph.
Can you share some specifics (speedups) here to help motivate?
Thanks, and I will collect numbers in some days (after fixing another issue I am currently working on) and post here.
So, this varies on the input graphs and the target system (relative speed of computing vs network bandwidth), but for Graph 500 style input graphs and 16 GPUs, using priorities cut 15 % of the execution time (relative to multi-stream execution using the same priority) for the most expensive kernel in PageRank and 12% of the total PageRank execution time.
Currently, I have tested this for only one part of the cuGraph code, but there are many other places to apply the same optimization, so this can have similar impacts on a range of graph algorithms.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
@seunghwak not working on this yet, but thinking about it. How many priorities do you use in your example?
Is your feature request related to a problem? Please describe. In multi-node multi-GPU execution, running communication-bound kernel with higher-priority helps in overlapping communication with computation.
Say kernel A has little computing and bigger inter-GPU communication. Kernel B has way more computing but requires only little inter-GPU communication.
If we run kernel A on a higher priority stream, kernel A will finish computation as soon as possible and start communication while kernel B is still working on local computing; this effectively overlaps communication with computation.
rmm's cuda_stream_pool currently does not support stream priorities and I cannot implement this strategy using rmm::cuda_stream_pool.
Describe the solution you'd like I should be able to get a higher priority stream using (a variant of)
get_stream
or should be able to create a stream pool with a priority. A single pool may manage streams with different priorities (based on index ranges) or application may maintain multiple pools (one pool per priority).