rapidsai / cugraph

cuGraph - RAPIDS Graph Analytics Library
https://docs.rapids.ai/api/cugraph/stable/
Apache License 2.0
1.77k stars 304 forks source link

[BUG]: Leiden clustering is not affected by random_state #4367

Closed mbruhns closed 6 months ago

mbruhns commented 7 months ago

Version

24.06.00a42

Which installation method(s) does this occur on?

Conda

Describe the bug.

When changing the parameter random_state for cugraph.leiden the detected clusters are not affected. I am not too sure about the order of magnitude of this effect, but I expected this to somewhat change the results.

Minimum reproducible example

import cuml
import cudf
import cugraph
import cupy as cp
import numpy as np
from itertools import combinations

n_samples = 1000
n_features = 20
centers = 5
n_neighbors = 10
metric = "cosine"
resolution = 1.0
cluster_runs = 10

X, _ = cuml.make_blobs(n_samples=n_samples,
                       n_features=n_features,
                       centers=centers)

# Use n_neighbors + 1 to account for self index
model = cuml.NearestNeighbors(n_neighbors=n_neighbors+1, metric=metric, algorithm="brute")
model.fit(X)
knn_dist, knn_indices = model.kneighbors(X)

# Remove self index
knn_dist = knn_dist[:,1:]
knn_indices = knn_indices[:,1:]

source_array = np.repeat(np.arange(knn_indices.shape[0]), knn_indices.shape[1])
destination_array = knn_indices.ravel()
weight_array = knn_dist.ravel()

adj_df = cudf.DataFrame(columns=["source", "destination", "weight"])

adj_df["source"] = source_array
adj_df["destination"] = destination_array
adj_df["weight"] = weight_array

G = cugraph.Graph()
G.from_cudf_edgelist(input_df=adj_df, source="source", destination="destination", weight="weight")

partition_lst = []
for i in range(cluster_runs):
    parts, _ = cugraph.leiden(G, resolution=resolution, random_state=i)
    partition_lst.append(parts.partition.values)

for a,b in combinations(partition_lst, 2):
    assert (a == b).all()

Relevant log output

No response

Environment details

I could not find that script. If you need this information please let me know how to run it.

Other/Misc.

When I construct the KNN-graph I get the following warning:

/home/ubuntu/miniforge3/envs/rapids/lib/python3.10/site-packages/cugraph/structure/symmetrize.py:93: FutureWarning: Multi is deprecated and the removal of multi edges will no longer be supported from 'symmetrize'. Multi edges will be removed upon creation of graph instance. warnings.warn(

Am I doing something wrong here, or does deprecation warning not effect this?

Code of Conduct

ChuckHastings commented 7 months ago

That FutureWarning is not related to the problem you are experiencing. It is warning of a future change (not sure of the timeline), but doesn't affect future behavior.

We will investigate this issue. We have been exploring several issues related to non-determinism in Leiden, we will need to evaluate what the impact of random_state should be and validate that the implementation is working as expected. I will let you know what we discover, and if something is wrong we will fix it.

ChuckHastings commented 7 months ago

The short answer... this is expected behavior for your test example. Details below.

Looked through the code to evaluate this. Our Leiden implementation was built upon our Louvain implementation. Currently our use of random_state is minimal.

  1. Our Leiden implementation, like our Louvain implementation, executes in parallel. We compute the best Louvain move for all vertices concurrently, and then decide which of the moves best improve modularity. We use a deterministic technique to avoid conflicting moves. The consequence of this is that the first place where randomness is generally inserted (exploring vertices in serial in a random order) isn't really necessary. We do have an option in the C++ code for Louvain (not currently available in Leiden) to introduce randomness in this phase, but it's not currently exposed to the python layer (and not available in Leiden).
  2. The random_state could be used to allow random moves in the refinement phase. That is currently disabled. We plan to run some experiments on whether that adds value with the parallel implementation or not, having things deterministic has been helpful for debugging and validating, and we haven't gotten to those experiments yet.
  3. The random_state is currently only used in the maximal independent set phase of the Leiden algorithm to determine the subsets of vertices we process in each pass. We use 1024 vertices as the base size to get sufficient parallelism in the MIS computation. So your graph is too small to ever notice the randomness that is there. For a larger graph you should expect to see some minimal effects.

If you have a use case for why more randomness would be helpful, please let us know. We're happy to explore exposing options for enabling randomness (1) and (2). It just hasn't been a priority so far.

mbruhns commented 7 months ago

Alright, thank you for clarifying this!

ChuckHastings commented 6 months ago

Going to close this. If you encounter situations where enabling these other random_state uses would be helpful please open a new issue. We do plan on eventually measuring the impact of these and adjusting the implementation accordingly, so also keep an eye on that work.