milesgranger / gap_statistic

Dynamically get the suggested clusters in the data for unsupervised learning.
The Unlicense
217 stars 46 forks source link
cluster cluster-count clustering kmeans python scikit-learn unsupervised unsupervised-learning

Python implementation of the Gap Statistic

PythonCI RustCI

Downloads Coverage Status Code Health Code Style


Maintenance mode

I've lost interest/time in developing this further, other things have taken priority for some time now. However, all is not lost. I will be willing to review/comment on any issues/PRs but will not complete any fixes or feature requests myself.


Purpose

Dynamically identify the suggested number of clusters in a data-set using the gap statistic.


Full example available in a notebook HERE


Install:

Bleeding edge:

pip install git+git://github.com/milesgranger/gap_statistic.git

PyPi:

pip install --upgrade gap-stat

With Rust extension:

pip install --upgrade gap-stat[rust]

Uninstall:

pip uninstall gap-stat

Methodology:

This package provides several methods to assist in choosing the optimal number of clusters for a given dataset, based on the Gap method presented in "Estimating the number of clusters in a data set via the gap statistic" (Tibshirani et al.).

The methods implemented can cluster a given dataset using a range of provided k values, and provide you with statistics that can help in choosing the right number of clusters for your dataset. Three possible methods are:

Note that none of the above methods is guaranteed to find an optimal value for k, and that they often contradict one another. Rather, they can provide more information on which to base your choice of k, which should take numerous other factors into account.


Use:

First, construct an OptimalK object. Optional intialization parameters are:

An example intialization:

optimalK = OptimalK(n_jobs=4, parallel_backend='joblib')

After the object is created, it can be called like a function, and provided with a dataset for which the optimal K is found and returned. Parameters are:

For example:

import numpy as np
n_clusters = optimalK(X, cluster_array=np.arange(1, 15))

After performing the search procedure, a DataFrame of gap values and other usefull statistics for each passed cluster count is now available as the gap_df attributre of the OptimalK object:

optimalK.gap_df.head()

The columns of the dataframe are:

Additionally, the relation between the above measures and the number of clusters can be plotted by calling the OptimalK.plot_results() method (meant to be used inside a Jupyter Notebook or a similar IPython-based notebook), which prints four plots: