Closed siddharthamantrala closed 3 days ago
Hi @siddharthamantrala
Thanks for submitting your question. Just taking a look -- what is the dataset you're using to create your df
object? Also, what method of installation did you use for cugraph
?
Hi @nv-rliu ,
It's a private dataset in h5ad format. The rapids_singlecell library has a function rsc.get.anndata_to_GPU(h5ad file) to move the data to GPU. The object passed to rsc.tl.leiden has the adjacency information computed using nearest neighbors (even this takes around 6 hrs to compute), but that is a different issue (For now am accelerating it using CAGRA, ~15 mins). I installed the cugraph=24.08.00
from https://docs.rapids.ai/install .
rsc.tl.leiden()
def leiden( adata: AnnData, resolution: float = 1.0, *, random_state: int | None = 0, restrict_to: tuple[str, Sequence[str]] | None = None, key_added: str = "leiden", adjacency: sparse.spmatrix | None = None, n_iterations: int = 100, use_weights: bool = True, neighbors_key: str | None = None, obsp: str | None = None, copy: bool = False, ) -> AnnData | None: """ Performs Leiden clustering using cuGraph, which implements the method described in:
Traag, V.A., Waltman, L., & van Eck, N.J. (2019). From Louvain to
Leiden: guaranteeing well-connected communities. Sci. Rep., 9(1), 5233.
DOI: 10.1038/s41598-019-41695-z
Parameters
----------
adata :
annData object
resolution
A parameter value controlling the coarseness of the clustering.
(called gamma in the modularity formula). Higher values lead to
more clusters.
random_state
Change the initialization of the optimization. Defaults to 0.
restrict_to
Restrict the clustering to the categories within the key for
sample annotation, tuple needs to contain
`(obs_key, list_of_categories)`.
key_added
`adata.obs` key under which to add the cluster labels.
adjacency
Sparse adjacency matrix of the graph, defaults to neighbors
connectivities.
n_iterations
This controls the maximum number of levels/iterations of the
Leiden algorithm. When specified, the algorithm will terminate
after no more than the specified number of iterations. No error
occurs when the algorithm terminates early in this manner.
use_weights
If `True`, edge weights from the graph are used in the
computation (placing more emphasis on stronger edges).
neighbors_key
If not specified, `leiden` looks at `.obsp['connectivities']`
for neighbors connectivities. If specified, `leiden` looks at
`.obsp[.uns[neighbors_key]['connectivities_key']]` for neighbors
connectivities.
obsp
Use .obsp[obsp] as adjacency. You can't specify both
`obsp` and `neighbors_key` at the same time.
copy
Whether to copy `adata` or modify it in place.
"""
# Adjacency graph
from cugraph import leiden as culeiden
adata = adata.copy() if copy else adata
if adjacency is None:
adjacency = _choose_graph(adata, obsp, neighbors_key)
if restrict_to is not None:
restrict_key, restrict_categories = restrict_to
adjacency, restrict_indices = restrict_adjacency(
adata=adata,
restrict_key=restrict_key,
restrict_categories=restrict_categories,
adjacency=adjacency,
)
g = _create_graph(adjacency, use_weights)
# Cluster
leiden_parts, _ = culeiden(
g,
resolution=resolution,
random_state=random_state,
max_iter=n_iterations,
)
# Format output
groups = (
leiden_parts.to_pandas().sort_values("vertex")[["partition"]].to_numpy().ravel()
)
if restrict_to is not None:
if key_added == "leiden":
key_added += "_R"
groups = rename_groups(
adata,
key_added=key_added,
restrict_key=restrict_key,
restrict_categories=restrict_categories,
restrict_indices=restrict_indices,
groups=groups,
)
adata.obs[key_added] = pd.Categorical(
values=groups.astype("U"),
categories=natsorted(map(str, np.unique(groups))),
)
# store information on the clustering parameters
adata.uns["leiden"] = {}
adata.uns["leiden"]["params"] = {
"resolution": resolution,
"random_state": random_state,
"n_iterations": n_iterations,
}
return adata if copy else None
So if I'm understanding this correctly, you have an edgelist (cudf.DataFrame
) created from your data, but when you try to create a cugraph.Graph
object by calling g.from_cudf_edgelist(df, source="source", destination="destination", weight="weights")
, it is taking very long. This is before you even get chance to call cugraph.Leiden
What is the size of the edge-list that you have? Can you share the output from running this in _create_graph()
:
type(df)
df.info(verbose=True)
Thank you for sharing. Please give me just a moment
Hi @siddharthamantrala ,
Takes forever to execute the below line g.from_cudf_edgelist(df, source="source", destination="destination", weight="weights")
I tried running a simple example to see if I could reproduce the problem, but I saw a reasonable runtime for the above line. I've attached an example which generates an edgelist and used it to populate a graph. I'm using a smaller GPU than you are so my example graph is only 7.8M nodes and 164.1M edges, but that only took 10 seconds to populate using G.from_cudf_edgelist()
.
Can you try this example and let us know if you see unexpectedly slow performance?
Hi @rlratzel ,
Thanks a lot for sharing the code and looking into the issue. I tried the code you shared. On the above mentioned GPU specs, your example graph took 4 seconds to populate using G.from_cudf_edgelist()
. When I increase the scale
factor to 25
, I run into memory issues. If I try using RAPIDS MEMORY MANAGEMENT (rmm
), I am able to increase the scale
factor to 25
and populate the graph in ~ 1 min. With rmm
for your example graph it takes ~30 seconds. When I increase the scale
to 26
it takes forever to populate the graph. I attaching the script with the rmm
related (commented out) changes made to the script. Please look at it and suggest how can I go forward with it.
Thanks @siddharthamantrala for the updated script. I ran it on my workstation using scale 26 with RMM managed memory enabled and I'm seeing the same behavior you are. I'm going to debug further and I'll update this issue when I find out more.
Hi @siddharthamantrala , I think the problem here can be traced back to a cudf
issue that has recently been resolved.
I'm able to reproduce your problem with an older version of cudf, but the problem is resolved and I'm able to run your script to completion in a few minutes on scale 26 when I upgrade cudf
to the latest version from the rapidsai-nightly
channel.
Can you check your cudf
version (run "conda list cudf" if you're using conda
)? For me, here's what I saw:
The broken version of cudf
I was using:
cudf 24.10.00a196 cuda11_py310_240816_ge690d9d25b_196 rapidsai-nightly
The updated version of cudf
that works for me:
cudf 24.10.00a292 cuda11_py310_240905_gad1369d2d6_292 rapidsai-nightly
Hi @siddharthamantrala
Just following up here. Were you able to resolve the issue and create the Graph? I'll go ahead and mark this as resolved for now but LMK!
Hi @rlratzel ,
Thanks a lot for letting me know and following up about the updated cudf
version. Could I know if it is updated in the latest cudf
nightly installation available from the https://docs.rapids.ai/install page. For the issue I posted and the current version I have already installed is the Stable 24.08 version using pip
. Do I also have to update the other RAPIDS packages too?
What is your question?
Hello @benfred @akasper @mattf,
I am trying to run leiden clustering on ~40 M cells. During the run I see the GPU is idle in terms of power usage and is forever to perform the leiden clustering. It takes time to execute the code below. Could I please know how can I sort out the issue?
rsc.tl.leiden(adatafilt)
->def leiden (args):
->g = _create_graph(adjacency, use_weights)
->`def _create_graph(adjacency, use_weights=True): from cugraph import Graph
Takes forever to execute the below line
g.from_cudf_edgelist(df, source="source", destination="destination", weight="weights")
Though I posted this in the rapids_singlecell library (rsc), I assume because cuGraph is taking too long, your inputs would be great.
Best, Sid
Code of Conduct