rapidsai / rmm

RAPIDS Memory Manager
https://docs.rapids.ai/api/rmm/stable/
Apache License 2.0
495 stars 198 forks source link

[FEA] Priority support in cuda_stream_pool #939

Open seunghwak opened 2 years ago

seunghwak commented 2 years ago

Is your feature request related to a problem? Please describe. In multi-node multi-GPU execution, running communication-bound kernel with higher-priority helps in overlapping communication with computation.

Say kernel A has little computing and bigger inter-GPU communication. Kernel B has way more computing but requires only little inter-GPU communication.

If we run kernel A on a higher priority stream, kernel A will finish computation as soon as possible and start communication while kernel B is still working on local computing; this effectively overlaps communication with computation.

rmm's cuda_stream_pool currently does not support stream priorities and I cannot implement this strategy using rmm::cuda_stream_pool.

Describe the solution you'd like I should be able to get a higher priority stream using (a variant of) get_stream or should be able to create a stream pool with a priority. A single pool may manage streams with different priorities (based on index ranges) or application may maintain multiple pools (one pool per priority).

harrism commented 2 years ago

@seunghwak have you prototyped this with raw CUDA streams to verify that you get the benefits you expect?

seunghwak commented 2 years ago

@seunghwak have you prototyped this with raw CUDA streams to verify that you get the benefits you expect?

Yes, this was used to produced the PageRank demo results for GTC 2020. I'm working on bringing this optimization to cugraph.

harrism commented 2 years ago

Can you share some specifics (speedups) here to help motivate?

seunghwak commented 2 years ago

Thanks, and I will collect numbers in some days (after fixing another issue I am currently working on) and post here.

seunghwak commented 2 years ago

So, this varies on the input graphs and the target system (relative speed of computing vs network bandwidth), but for Graph 500 style input graphs and 16 GPUs, using priorities cut 15 % of the execution time (relative to multi-stream execution using the same priority) for the most expensive kernel in PageRank and 12% of the total PageRank execution time.

Currently, I have tested this for only one part of the cuGraph code, but there are many other places to apply the same optimization, so this can have similar impacts on a range of graph algorithms.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

harrism commented 1 year ago

@seunghwak not working on this yet, but thinking about it. How many priorities do you use in your example?