Closed grigory93 closed 8 years ago
Gregory, in feature idea issues like this, it'd be nice to include context as to what's been done already, if anything. And if something has been written, giving the reader (potential contributor) guidance on where to find the code to review and build on.
branch kmeans is on github now. See files computeKmeans.R, plottingKmeans.R
merged kmeans into develop
Functional spec
computeKmeans utilizes Aster kmeans SQL/MR function plus comprehensive support for kmeans algorithm for all its phases:
1. Data prep
Standardatize each variable before running kmeans, This must be optional but default behavior.
Also, k-means is sensitive to the order of observations, so re-partitioning data set could be an option.
2. Initialization of centroids
Consider following algorithms:
manual
Just pass K points as function parameter.
Forgy method (random)
Randomly chooses K observations from the data set and uses these as the initial means (centroids).
random partition
First randomly assign a cluster to each observation (row) and then compute centroid for each random cluster. Use them as initial centroids.
sampling
Using hierarchical clustering cluster a sample of the data, to obtain K clusters. Then pick a point from each cluster (e.g. point closest to centroid). Make sure that sample fits in main memory. In fact, alternatively, we could run any clustering algo available in R including k-means itself.
"dispersed" set of points
Pick first point at random, then pick next point to be the one whose minimum distance from the selected points is as large as possible. Repeat until we have K points
canopyUse SQL/MR function canopy to obtain K centroids and use them for initial centroids in kmeanscanopy function is driver-based which means it requires Java client and special jar installed to run it. This is beyond toaster reach at the moment (unfortunately).3. Computing kmeans
4. Iterating over K
5. Visualizing results
This is wide open to ideas. Examples:
References: