Closed wolfram77 closed 5 months ago
@rlratzel should have a better answer for your question. Alex Fender has moved on to our cuopt effort and doesn't work on this software anymore.
I'm fuzzy on the performance overheads of the python API - where they exist and if/how you can avoid them. I know at one time we had (and perhaps still have) some lazy computations that occur on the first call to an algorithm. I believe there is a way to avoid those. @rlratzel should be able to clarify.
Expensive validation steps are directly enabled in the C/C++ layer by passing a parameter called do_expensive_check
. This is set to False by default. My quick glance at the latest python for Leiden indicates there is no mechanism for you to override this. So the only error checks that occur are fast error checks (did you pass in an edge weights pointer is - I think - the only validation that occurs on the Leiden algorithm).
As implemented, memory allocation for the result is done inside of Leiden. That memory allocation does not include initialization, we copy the result into uninitialized memory. So the performance overhead of memory allocation of the result should be minimal. All other memory allocation done inside of Leiden is dynamic based on the progress of the clustering algorithm. If you configure RMM to use the pool allocator then memory allocations should be pretty fast. Perhaps @rlratzel can clarify how to do that from python.
Hi @wolfram77 , I don't know if this is acceptable, but I think the best way to benchmark only the algorithm implementation and eliminate any additional allocations/conversions/input checks done in the cugraph python library would be to benchmark leiden from the C++ library in C++. Because the cugraph python library calls the libcugraph C++ library implementation, you'd be benchmarking as close to the algorithm implementation as possible (without modifying C++ source code to isolate further beyond the API).
If C++ isn't an option, you could benchmark leiden from our lower-level python library (pylibcugraph.leiden). The cugraph python library wraps pylibcugraph
and adds various conveniences and additional checks which you'd want to avoid in the benchmark you're describing, so pylibcugraph.leiden
might be the next best function to benchmark after C++.
Finally, configuring RMM to use pool allocation might also be something to consider, as @ChuckHastings mentioned. You can read about how to do that from python here.
Thanks @ChuckHastings and @rlratzel
As suggested, I configured RMM to use pool allocation (code below). This seems to help a lot.
pool = rmm.mr.PoolMemoryResource(rmm.mr.CudaMemoryResource(), initial_pool_size=2**36)
rmm.mr.set_current_device_resource(pool)
I also discard the runtime of the first call to cugraph.leiden()
. This also helps.
Below is the runtimes we observed for cuGraph Leiden (inc. other comparisons).
cuGraph Leiden fails to run on the arabic-2005, uk-2005, webbase-2001, it-2004, and sk-2005 graphs due to out of memory issues. We use an NVIDIA A100 GPU.
What is your question?
Hello @afender I want to benchmark the runtime of
cugraph.leiden()
. For a benchmark of the algorithm, one should only consider the runtime of the algorithm, and exclude the runtime for validations and initial memory allocations. A direct measurement of runtime around the cugraph call includes all of the above. Is it possible to get an "algorithm runtime" from the call tocugraph.leiden()
?Code of Conduct