himank / K-Means

K-Means Clustering using MapReduce
75 stars 62 forks source link

K-Means Clustering

K-means clustering is a classical clustering algorithm that uses an expectation maximization like technique to partition a number of data points into k clusters. K-means clustering is commonly used for a number of classification applications. Because k-means is run on such large data sets, and because of certain characteristics of the algorithm, it is a good candidate for parallelization.

The goal of this project was to implement a framework in java for performing k-means clustering using Hadoop MapReduce.

In this problem, we have considered inputs a set of n 1-dimensional points and desired clusters of size 3. Once the k initial centers are chosen, the distance is calculated(Euclidean distance) from every point in the set to each of the 3 centers & point with the corresponding center is emitted by the mapper. Reducer collect all of the points of a particular centroid and calculate a new centroid and emit.

Termination Condition

When difference between old and new centroid is less than or equal to 0.1

Algorithm

Samples

For Centroid, this should be fine: 20.0 30.0 40.0

For data something like this simple should work: 20 23 19 29 33 29 43 35 18 25 27