Open rlratzel opened 1 year ago
Hi @rlratzel do you have any ideas on when this might be fixed?
Some background - I am using dask cugraph to perform louvain community detection on large graphs (300m to 1b edges). This was originally implemented using 22.04 and worked well on a multi-GPU system with 64GB of GPU memory. After upgrading to 22.12 we started getting OOM errors for graphs that previously succeeded. The OOM usually happens in compute_renumber_edge_list()
when creating the NumberMap.
After some investigation, I believe that this is due to the graph symmetrization added here - https://github.com/rapidsai/cugraph/pull/2247 - which causes growth in the data frames. Shortly after this change was merged, another PR (https://github.com/rapidsai/cugraph/pull/2394) forced renumbering and storing the plc graph in simpleDistributedGraphImpl.__from_edgelist
.
The following traceback shows where the OOM is being produced.
File "/louvain/louvain.py", line 37, in detect_communities
graph.from_dask_cudf_edgelist(edges_df, source="from", destination="to", edge_attr="weight", renumber=True)
File "/opt/conda/envs/rapids/lib/python3.9/site-packages/cugraph/structure/graph_classes.py", line 299, in from_dask_cudf_edgelist
self._Impl._simpleDistributedGraphImpl__from_edgelist(
File "/opt/conda/envs/rapids/lib/python3.9/site-packages/cugraph/structure/graph_implementation/simpleDistributedGraph.py", line 236, in __from_edgelist
self.compute_renumber_edge_list(
File "/opt/conda/envs/rapids/lib/python3.9/site-packages/cugraph/structure/graph_implementation/simpleDistributedGraph.py", line 937, in compute_renumber_edge_list
) = NumberMap.renumber_and_segment(
File "/opt/conda/envs/rapids/lib/python3.9/site-packages/cugraph/structure/number_map.py", line 595, in renumber_and_segment
data = get_distributed_data(df)
File "/opt/conda/envs/rapids/lib/python3.9/site-packages/cugraph/dask/common/input_utils.py", line 244, in get_distributed_data
data = DistributedDataHandler.create(data=ddf)
File "/opt/conda/envs/rapids/lib/python3.9/site-packages/cugraph/dask/common/input_utils.py", line 106, in create
gpu_futures = client.sync(
File "/opt/conda/envs/rapids/lib/python3.9/site-packages/distributed/utils.py", line 339, in sync
return sync(
File "/opt/conda/envs/rapids/lib/python3.9/site-packages/distributed/utils.py", line 406, in sync
raise exc.with_traceback(tb)
File "/opt/conda/envs/rapids/lib/python3.9/site-packages/distributed/utils.py", line 379, in f
result = yield future
File "/opt/conda/envs/rapids/lib/python3.9/site-packages/tornado/gen.py", line 762, in run
value = future.result()
RuntimeError: coroutine raised StopIteration
Who: cugraph users What: improve memory efficiency when creating undirected graphs to reduce OOM errors Why: undirected graphs require a symmetrization step which is currently prohibitively inefficient, resulting in OOM errors where they seemingly should not. This results in users not being able to use cugraph when they, in many cases, should be able to based on input size and GPU memory size.