rapidsai / cugraph

cuGraph - RAPIDS Graph Analytics Library
https://docs.rapids.ai/api/cugraph/stable/
Apache License 2.0
1.7k stars 301 forks source link

[BUG]: Incorrect numbering of partitions in Leiden clustering #4368

Closed mbruhns closed 5 months ago

mbruhns commented 6 months ago

Version

24.06.00a42

Which installation method(s) does this occur on?

Conda

Describe the bug.

When clustering with Leiden, the resulting partition labels are not consecutive integers. This happens independently of the value set for renumber in the graph construction.

Minimum reproducible example

import cuml
import cudf
import cugraph
import cupy as cp
import numpy as np

n_samples = 1000
n_features = 20
centers = 5
n_neighbors = 10
metric = "cosine"
resolution = 1.0
cluster_runs = 10

X, _ = cuml.make_blobs(n_samples=n_samples,
                       n_features=n_features,
                       centers=centers)

# Use n_neighbors + 1 to account for self index
model = cuml.NearestNeighbors(n_neighbors=n_neighbors+1, metric=metric, algorithm="brute")
model.fit(X)
knn_dist, knn_indices = model.kneighbors(X)

# Remove self index
knn_dist = knn_dist[:,1:]
knn_indices = knn_indices[:,1:]

source_array = np.repeat(np.arange(knn_indices.shape[0]), knn_indices.shape[1])
destination_array = knn_indices.ravel()
weight_array = knn_dist.ravel()

adj_df = cudf.DataFrame(columns=["source", "destination", "weight"])

adj_df["source"] = source_array
adj_df["destination"] = destination_array
adj_df["weight"] = weight_array

G = cugraph.Graph()
G.from_cudf_edgelist(input_df=adj_df, source="source", destination="destination", weight="weight", renumber=True)

parts, _ = cugraph.leiden(G, resolution=resolution)
print(parts.partition.value_counts().sort_index())

Relevant log output

No response

Environment details

I could not find that script. If you need this information please let me know how to run it.

Other/Misc.

No response

Code of Conduct

ChuckHastings commented 6 months ago

Please test this again. We literally merged a pull request related to Leiden yesterday that should be in version 24.06.00a43 to solve a different problem that included a renumbering of the partition labels so that they should meet the criteria you expect.

mbruhns commented 6 months ago

Just tested it and got the same behaviour with version 24.06.00a43.

ChuckHastings commented 6 months ago

What is your output from the above? When I tested it on my latest checkout I got what I thought was a reasonable answer.

mbruhns commented 6 months ago

For random_state=42 in cuml.make_blobs I consistently get

partition
0          240
6          193
14         204
19         191
24         172
Name: count, dtype: int64
ChuckHastings commented 6 months ago

OK. I see this behavior now. Thanks for verifying the problem persists after our recent update. I will investigate further.

ChuckHastings commented 5 months ago

Should be resolved with the linked PR. Hopefully in conda sometime tonight.