rapidsai / cugraph

cuGraph - RAPIDS Graph Analytics Library
https://docs.rapids.ai/api/cugraph/stable/
Apache License 2.0
1.72k stars 302 forks source link

cugraph python graph symmetrization is prohibitively inefficient, needs optimization #3500

Open rlratzel opened 1 year ago

rlratzel commented 1 year ago

Who: cugraph users What: improve memory efficiency when creating undirected graphs to reduce OOM errors Why: undirected graphs require a symmetrization step which is currently prohibitively inefficient, resulting in OOM errors where they seemingly should not. This results in users not being able to use cugraph when they, in many cases, should be able to based on input size and GPU memory size.

kelly-grizzle-sp commented 1 year ago

Hi @rlratzel do you have any ideas on when this might be fixed?

Some background - I am using dask cugraph to perform louvain community detection on large graphs (300m to 1b edges). This was originally implemented using 22.04 and worked well on a multi-GPU system with 64GB of GPU memory. After upgrading to 22.12 we started getting OOM errors for graphs that previously succeeded. The OOM usually happens in compute_renumber_edge_list() when creating the NumberMap.

After some investigation, I believe that this is due to the graph symmetrization added here - https://github.com/rapidsai/cugraph/pull/2247 - which causes growth in the data frames. Shortly after this change was merged, another PR (https://github.com/rapidsai/cugraph/pull/2394) forced renumbering and storing the plc graph in simpleDistributedGraphImpl.__from_edgelist.

The following traceback shows where the OOM is being produced.

  File "/louvain/louvain.py", line 37, in detect_communities
    graph.from_dask_cudf_edgelist(edges_df, source="from", destination="to", edge_attr="weight", renumber=True)
  File "/opt/conda/envs/rapids/lib/python3.9/site-packages/cugraph/structure/graph_classes.py", line 299, in from_dask_cudf_edgelist
    self._Impl._simpleDistributedGraphImpl__from_edgelist(
  File "/opt/conda/envs/rapids/lib/python3.9/site-packages/cugraph/structure/graph_implementation/simpleDistributedGraph.py", line 236, in __from_edgelist
    self.compute_renumber_edge_list(
  File "/opt/conda/envs/rapids/lib/python3.9/site-packages/cugraph/structure/graph_implementation/simpleDistributedGraph.py", line 937, in compute_renumber_edge_list
    ) = NumberMap.renumber_and_segment(
  File "/opt/conda/envs/rapids/lib/python3.9/site-packages/cugraph/structure/number_map.py", line 595, in renumber_and_segment
    data = get_distributed_data(df)
  File "/opt/conda/envs/rapids/lib/python3.9/site-packages/cugraph/dask/common/input_utils.py", line 244, in get_distributed_data
    data = DistributedDataHandler.create(data=ddf)
  File "/opt/conda/envs/rapids/lib/python3.9/site-packages/cugraph/dask/common/input_utils.py", line 106, in create
    gpu_futures = client.sync(
  File "/opt/conda/envs/rapids/lib/python3.9/site-packages/distributed/utils.py", line 339, in sync
    return sync(
  File "/opt/conda/envs/rapids/lib/python3.9/site-packages/distributed/utils.py", line 406, in sync
    raise exc.with_traceback(tb)
  File "/opt/conda/envs/rapids/lib/python3.9/site-packages/distributed/utils.py", line 379, in f
    result = yield future
  File "/opt/conda/envs/rapids/lib/python3.9/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
RuntimeError: coroutine raised StopIteration