rapidsai / cugraph

cuGraph - RAPIDS Graph Analytics Library
https://docs.rapids.ai/api/cugraph/stable/
Apache License 2.0
1.63k stars 293 forks source link

[FEA]: nx-cugraph should intercept built-in constructors like `from_pandas_edgelist` if `NETWORKX_BACKEND_PRIORITY=cugraph` #4456

Open beckernick opened 1 month ago

beckernick commented 1 month ago

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

High

Please provide a clear description of problem this feature solves

When we use nx-cugraph, we currently need to create the NetworkX graph on the CPU regardless of whether every algorithm we intend to use is supported by the cuGraph backend. As a result, we pay a non-trivial performance penalty converting between CPU and GPU graphs.

The new caching mechanism configurable via CACHE_CONVERTED_GRAPH=True was designed to address this problem, making it possible to only pay this cost once per graph if you're going to run multiple algorithms.

But it would be great to avoid this cost in the first place by dispatching on the graph construction operators in addition to the algorithms. In the example below, we spend significant time in from_pandas_edgelist and _convert_graph (the latter of which is only a one-time cost if we use caching).

If I've already committed to using the cuGraph backend as the top priority backend, I'd ideally just create the graph on the GPU and only pay the CPU/GPU conversion cost if I need to fallback to the CPU.

# !wget https://data.rapids.ai/cugraph/datasets/cit-Patents.csv

%env NETWORKX_BACKEND_PRIORITY=cugraph

import pandas as pd
import networkx as nx

df = pd.read_csv("cit-Patents.csv", sep=" ", names=["src", "dst"], dtype="int32")
%%snakeviz

G = nx.from_pandas_edgelist(df.head(1000000), source="src", target="dst")
pr = nx.pagerank(G, alpha=0.9)
Screenshot 2024-05-31 at 10 03 40 AM

But cuGraph supports from_pandas_edgelist and it's much faster (100ms vs 8s in this case):

%timeit -n3 -r3 G_gpu = cugraph.from_pandas_edgelist(df.head(1000000), source="src", destination="dst")
71.2 ms ± 11.8 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)

Describe your ideal solution

The following code should dispatch to the cuGraph backend for from_pandas_edgelist in addition to pagerank.

# !wget https://data.rapids.ai/cugraph/datasets/cit-Patents.csv

%env NETWORKX_BACKEND_PRIORITY=cugraph

import pandas as pd
import networkx as nx

df = pd.read_csv("cit-Patents.csv", sep=" ", names=["src", "dst"], dtype="int32")

G = nx.from_pandas_edgelist(df.head(1000000), source="src", target="dst")
pr = nx.pagerank(G, alpha=0.9)

Describe any alternatives you have considered

No response

Additional context

No response

Code of Conduct

beckernick commented 1 month ago

Seems like this would be an expansion of https://github.com/networkx/networkx/pull/6876

eriknw commented 1 month ago

Thanks for raising this issue! I think this behavior ought to be controllable via a different environment variable and/or config option... and I do think this is important.

Today, you can add backend="cugraph" to e.g. nx.from_pandas_edgelist, but this is not longer "zero code change", so we'd like to make this story better.

eriknw commented 2 weeks ago

I have begun to explore a solution to this in https://github.com/networkx/networkx/pull/7502