sknetwork-team / scikit-network

Graph Algorithms
Other
602 stars 67 forks source link

Confusing results from shuffle_nodes in Louvain algorithm #521

Closed robin-na closed 2 years ago

robin-na commented 2 years ago

Description

Hello Scikit-Network Team!

I've been running some Louvain algorithms to hard cluster my networks (node sizes between 1k and 2k) and received some unexpected results when I tried hierarchical Louvain and the plain Louvain.

(1) I get way too many clusters when I shuffle nodes (both for heirarchical Louvain and plain Louvain) My understanding that the default algorithm produces stable results (i.e., same cluster assignment in every iteration) and the only way to randomize for optimization is to set shuffle_nodes as True.

When I set shuffle_nodes as False with resolution parameter 1, Louvain returns 6 clusters. When I set shuffle_nodes as True, however, it returns more than 35 and sometimes more than 50 clusters with the same resolution parameter!

I would appreciate if you could check if there's an issue with the algorithm. If I am supposed to expect such difference, a brief explanation would help a lot.

My guess is that there's a possibility that when I shuffle nodes, the modularity sets into 'potts'. This is a wild guess based on my observation that that's the only way I can that many clusters.

(2) Cluster assignments from non-shuffled Louvain and Hierarchical Louvain are different

So now I'm dealing with non-shuffled cases. I expected the plain Louvain and the first-level clusters in Hierarchical Louvain to return the same assignment (given they're not randomized) but they don't.

(3) Non-shuffled Louvain methods are different Gephi's modularity clustering

For your Louvain I get 6 clusters. When I run the Gephi algorithm without randomizing, it returns 5 and assignments are obviously different. Sk network returns 6 clusters both in Newman and Dugue methods. I wonder what may be causing this difference?

The most urgent issue is (1). In general, I would like to understand more on how the "shuffle_nodes" argument works.

Many thanks in advance!

What I Did

(1) For the first issue I ran

from sknetwork.clustering import Louvain

LC_noshuffle = clustering.Louvain(shuffle_nodes=False)
LC_shuffle = clustering.Louvain(shuffle_nodes=True)

print("Number of clusters without shuffling Louvain")
print(len(set(LC_noshuffle.fit_transform(G_adj)))) #G_adj is the adjacency matrix

print("-----")

print("Number of clusters while shuffling Louvain (25 iterations)")
n_clusters = list()
for i in range(25):
    n_clusters.append(len(set(LC_shuffle.fit_transform(G_adj))))
print(n_clusters)

The output returns

Number of clusters without shuffling Louvain
6
-----
Number of clusters while shuffling Louvain (25 iterations)
[44, 37, 41, 42, 38, 34, 47, 44, 34, 51, 38, 42, 48, 39, 39, 41, 43, 38, 45, 39, 45, 37, 49, 47, 44]

(2) For the second issue I ran

from sknetwork.hierarchy import LouvainHierarchy
from sknetwork.clustering import Louvain
from scipy.cluster import hierarchy

LC = clustering.Louvain(shuffle_nodes=False)
LH = LouvainHierarchy(shuffle_nodes=False)

LC_assign = LC.fit_transform(G_adj) # 6 cluster assignment from plain Louvain

LH_assign = hierarchy.fcluster(LH.fit_transform(G_adj), t=6 , criterion='maxclust')
# 6 cluster assignment from Hierarchical Louvain when the maximum number of clusters is restricted to 6

print("Do they produce the same number of clusters?")
print(len(set(LC_assign)) == len(set(LH_assign))  ) 

print("-------------")

print("Do they produce the assignment?")
print(list(LC_assign) == list(LH_assign) )

The output is

Do they produce the same number of clusters?
True
-------------
Do they produce the same assignment?
False
tbonald commented 2 years ago

Many thanks for your feedback. The first issue has been fixed (see the develop branch); the new version of Louvain will be available in the next release. Meanwhile, you can use Louvain without the shuffling_nodes option and do the shuffling manually through:

louvain = Louvain()
index = np.arange(adjacency.shape[0])
np.random.shuffle(index)
adjacency = adjacency[index][:, index]
labels = louvain.fit_transform(adajcency)

I'm checking the other issues.

tbonald commented 2 years ago

As for your second point, there is no issue here. The fact that the labels differ simply comes from the fact that the clusterings are not aligned (i.e., the cluster indices do not match). The function fcluster has its own algorithm to assign the labels from the hierarchy, and there is not reason that this is the same as the original assignment done by Louvain.

tbonald commented 2 years ago

Finally, the difference with Gephi modularity might come from the tolerance parameters tol_optimization and tol_aggregation. Setting these parameters to 0 might decrease the number of clusters (but make the algorithm slower for large graphs).

tbonald commented 2 years ago

Unless you have other questions or comments, I will close this issue.