vtraag / leidenalg

Implementation of the Leiden algorithm for various quality functions to be used with igraph in Python.
GNU General Public License v3.0
575 stars 77 forks source link

Clustering results are not always the same given the same resolution parameter #107

Closed prakritipaul closed 1 year ago

prakritipaul commented 2 years ago

Hello, I am using the leiden algorithm implementation in iGraph, and noticed that when I repeat clustering using the same resolution parameter, I get different results. Here is some small debugging code I wrote to find this. I tracked the number of clusters post-clustering at each step.

# Do the initial clustering
clustering = do_leiden_clustering(G, resolution_parameter=initial_resolution, n_iterations=n_iterations)
best_num_clusters, best_modularity, best_membership = len(clustering), clustering.modularity, clustering.membership
print(f"best_num_clusters = {best_num_clusters}")
num_clusters = best_num_clusters
i = 0
# Continue until there is a difference between current and previous clustering results.
while num_clusters == best_num_clusters:
    clustering = do_leiden_clustering(G, resolution_parameter=initial_resolution, n_iterations=n_iterations)
    num_clusters, best_modularity, best_membership = len(clustering), clustering.modularity, clustering.membership
    print(f"{i} num_clusters = {num_clusters}")

    if num_clusters != best_num_clusters:
        print(f"{i} End because num_clusters = {num_clusters} != {best_num_clusters}!")
        break
    i += 1
# Results
best_num_clusters = 2
0 num_clusters = 2
1 num_clusters = 2
2 num_clusters = 2
3 num_clusters = 2
4 num_clusters = 2
5 num_clusters = 1
5 End because num_clusters = 1 != 2!

Because num_clusters changes at some point, this implies that the clustering results are not deterministic.

Why is this the case?

Thanks! PC

vtraag commented 1 year ago

Sorry for the late reply! I only got back to this after my summer holidays :tent: :sun_with_face: :sunglasses: and other travels :flight_departure:.

The Leiden algorithm is stochastic, meaning that every run, it may return different results. These differences depend on the random numbers being generated while running the algorithm. These differences may be relatively minor if there is a single relatively clear community structure, but the differences will be more substantial if there are multiple possible community structures.

If you want to ensure that you always get identical results, you need to set the seed for the random number generator. In igraph you can set the seed like this:

import random
random.seed(0)

By default, igraph used the random number generator from random, and so setting the seed in the random package leads to identical random numbers being generated also by igraph. See some more information here: https://igraph.org/python/versions/latest/api/index.html#set_random_number_generator

If you use the leidenalg package, you can set the seed directly in the package itself. For example, using the seed argument in the find_partition function, or if you are using an Optimiser object to detect communities you can use the set_rng_seed function.

Note that this issue tracker is only for the leidenalg package. The issue tracker for igraph can be found here, and if your question is really about igraph, not about leidenalg, you are more likely to get a response there (since I'm the only one following the leidenalg issue tracker). Alternatively, you could post your question on the igraph user forum, if it is a question rather than a bug report.

vtraag commented 1 year ago

I'm assuming this explanation clarifies the issue, so I'll close it for now. If you do have other questions, let me know though. If necessary, we can always re-open the issue.

prakritipaul commented 1 year ago

I am sorry for my late response, but yes, thank you! This was very helpful.