sknetwork-team / scikit-network

Graph Algorithms
Other
607 stars 66 forks source link

LouvainHierarchy returns more clusters than requested with cut_straight #455

Closed gokceneraslan closed 4 years ago

gokceneraslan commented 4 years ago

Description

When I fit a LouvainHierarchy and then try to get a clustering with e.g. 50 clusters (cut_straight(dendrogram, n_clusters=50)), I am getting 81 clusters instead. Paris() works perfectly fine on the same dataset. See reproducible example below.

What I Did

import numpy as np
import scanpy as sc  # required for the dataset, pip install scanpy
from sknetwork.hierarchy import LouvainHierarchy, cut_straight

ad = sc.datasets.paul15()
sc.pp.log1p(ad)
adjacency = np.corrcoef(ad.X, rowvar=False) 
adjacency[adjacency<0] = 0

louvain_hierarchy = LouvainHierarchy()
dendrogram = louvain_hierarchy.fit_transform(adjacency)
n_clusters = 50

labels = cut_straight(dendrogram, n_clusters=n_clusters)
n_clusters_observed = len(np.unique(labels))

assert n_clusters_observed == n_clusters, f'{n_clusters} clusters requested but {n_clusters_observed} clusters returned'

Output:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-12-a465df4cc71e> in <module>
     15 n_clusters_observed = len(np.unique(labels))
     16 
---> 17 assert n_clusters_observed == n_clusters, f'{n_clusters} clusters requested but {n_clusters_observed} clusters returned'

AssertionError: 50 clusters requested but 81 clusters returned
tbonald commented 4 years ago

Thanks for your feedback.

This is due to the fact that LouvainHierarchy returns a dendrogram with equal heights. When you do a straight cut on such a dendrogram, you may get a larger number of clusters than expected. I've updated the documentation in the develop branch.

Example

scikit-network version: 0.20.0 Python version: 3.8.3 Operating System: macOS 10.15

import numpy as np
from sknetwork.data import load_netset
from sknetwork.clustering import Louvain
from sknetwork.hierarchy import LouvainHierarchy, cut_straight

graph = load_netset("openflights")
adjacency = graph.adjacency

Clustering

louvain = Louvain()
labels = louvain.fit_transform(adjacency)
len(set(labels))

Output: 35

Hierarchical clustering

louvain_hierarchy = LouvainHierarchy()
dendrogram = louvain_hierarchy.fit_transform(adjacency)
labels = cut_straight(dendrogram, n_clusters=10)
len(set(labels))

Output: 35 (depth 1)

labels = cut_straight(dendrogram, n_clusters=35) len(set(labels))

Output: 35 (depth 1)

labels = cut_straight(dendrogram, n_clusters=36)
len(set(labels))

Output: 152 (depth 2)

np.unique(dendrogram[:, 2], return_counts=True)

Output: (array([0., 1., 2., 3.]), array([2616, 329, 117, 34]))

gokceneraslan commented 3 years ago

Thanks!