jingw2/size_constrained_clustering

Size Constrained Clustering Solver

GitHub PyPI - Downloads Codecov

Implementation of Size Constrained Clustering. Size constrained clustering can be treated as an optimization problem. Details could be found in a set of reference paper.

Installation

Requirement Python >= 3.6, Numpy >= 1.13, Cython >= 0.29

install from PyPI
```
pip install size-constrained-clustering
```

Methods

Fuzzy C-means Algorithm: Similar to KMeans, but use membership probability, not 0 or 1
Same Size Contrained KMeans Heuristics: Use Heuristics methods to reach same size clustering
Same Size Contrained KMeans Inspired by Minimum Cost Flow Problem
Minimum and Maximum Size Constrained KMeans Inspired by Minimum Cost Flow Problem
Deterministic Annealling Algorithm: Input target cluster distribution, return correspondent clusters
Shrinkage Clustering: base algorithm and minimum size constraints

Usage:

# setup
from size_constrained_clustering import fcm, equal, minmax, shrinkage
# by default it is euclidean distance, but can select others
from sklearn.metrics.pairwise import haversine_distances
import numpy as np

Fuzzy C-means

n_samples = 2000
n_clusters = 4
centers = [(-5, -5), (0, 0), (5, 5), (7, 10)]
X, _ = make_blobs(n_samples=n_samples, n_features=2, cluster_std=1.0,
                    centers=centers, shuffle=False, random_state=42)
model = fcm.FCM(n_clusters)
# use other distance function: e.g. haversine distance
# model = fcm.FCM(n_clusters, distance_func=haversine_distances)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_

alt text

Equal Size Constraint

n_samples = 2000
n_clusters = 3
X = np.random.rand(n_samples, 2)
# use minimum cost flow framework to solve
model = equal.SameSizeKMeansMinCostFlow(n_clusters)
# use heuristics method to solve
model = equal.SameSizeKMeansHeuristics(n_clusters)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_

alt text

Cluster size: 667, 667 and 666 in the figure above.

Minimum and Maximum Size Constraint

n_samples = 2000
n_clusters = 3
X = np.random.rand(n_samples, 2)
model = minmax.MinMaxKMeansMinCostFlow(n_clusters, size_min=400,   size_max=800)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_

alt text

Cluster size: 753, 645 and 602 in the figure above.

Deterministic Annealing

n_samples = 2000
n_clusters = 3
X = np.random.rand(n_samples, 2)
# distribution is the distribution of cluster sizes
model = da.DeterministicAnnealing(n_clusters, distribution=[0.1, 0.6, 0.3])
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_

alt text

Cluster size: 1200, 600 and 200 in the figure above, corresponding to distribution [0.6, 0.3, 0.1]

Shrinkage Clustering

The result might be not available.

n_samples = 1000
n_clusters = 4
centers = [(-5, -5), (0, 0), (5, 5), (7, 10)]
X, _ = make_blobs(n_samples=n_samples, n_features=2, cluster_std=1.0, centers=centers, shuffle=False, random_state=42)

model = shrinkage.Shrinkage(n_clusters, size_min=100)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_

alt text

Copyright

Third-party copyright in this distribution is noted where applicable.