Design and implement comprehensive kmeans clustering function computeKmeans

grigory93 commented 8 years ago

Functional spec

computeKmeans utilizes Aster kmeans SQL/MR function plus comprehensive support for kmeans algorithm for all its phases:

data prep
initial group centroids
computing aggregates on clusters
iterating over number of clusters (K)
1. Data prep

Standardatize each variable before running kmeans, This must be optional but default behavior.

Also, k-means is sensitive to the order of observations, so re-partitioning data set could be an option.

2. Initialization of centroids

Consider following algorithms:

manual

Just pass K points as function parameter.

Forgy method (random)

Randomly chooses K observations from the data set and uses these as the initial means (centroids).

random partition

First randomly assign a cluster to each observation (row) and then compute centroid for each random cluster. Use them as initial centroids.

sampling

Using hierarchical clustering cluster a sample of the data, to obtain K clusters. Then pick a point from each cluster (e.g. point closest to centroid). Make sure that sample fits in main memory. In fact, alternatively, we could run any clustering algo available in R including k-means itself.

"dispersed" set of points

Pick first point at random, then pick next point to be the one whose minimum distance from the selected points is as large as possible. Repeat until we have K points

canopy

~~Use SQL/MR function canopy to obtain K centroids and use them for initial centroids in kmeans~~ canopy function is driver-based which means it requires Java client and special jar installed to run it. This is beyond toaster reach at the moment (unfortunately).

3. Computing kmeans

4. Iterating over K

5. Visualizing results

This is wide open to ideas. Examples:

using facet grid (rows are resulting clusters, columns are predictors) show distribution for each variable within each cluster in K x N plot.
using facet grid (rows and columns are variables) show clusters in 2-dimensional spaces of all pair combination of variables.

References:

k-means algorithm: https://en.wikipedia.org/wiki/K-means_clustering
Preprocessing: http://stats.stackexchange.com/questions/21222/are-mean-normalization-and-feature-scaling-needed-for-k-means-clustering
Quick-R Cluster Analysis: http://www.statmethods.net/advstats/cluster.html

nswitanek commented 8 years ago

Gregory, in feature idea issues like this, it'd be nice to include context as to what's been done already, if anything. And if something has been written, giving the reader (potential contributor) guidance on where to find the code to review and build on.

grigory93 commented 8 years ago

branch kmeans is on github now. See files computeKmeans.R, plottingKmeans.R

grigory93 commented 8 years ago

merged kmeans into develop

teradata-aster-field / toaster