trz-maier / hwu-dmml-group1

F21DL_2019-2020 Data Mining and Machine Learning at Heriot-Watt University
0 stars 0 forks source link

Cluster the training data sets using k-means #8

Open trz-maier opened 5 years ago

trz-maier commented 5 years ago

Cluster the data sets ​train_smpl​, ​trainsmpl

arjunshenoymec commented 5 years ago

I have some experience with clustering algorithms particularly k-means. I can take this up. Will provide updates here.

arjunshenoymec commented 5 years ago

k-means is an unsupervised learning technique which uses clustering.

In an unsupervised learning technique there is no 'desired output' or 'labels' assigned along with the data. It is entirely upto the system to infer labels/patterns from the data. There are two main types of unsupervised learning techniques:

Clustering models can be further classified based on how the data points are grouped.

Kmeans is an iterative centroid method. An iterative method is one in which certain parts of the algorithm are repeated until convergence or the specified max number of iterations has been reached. The steps in kmeans is as follows:

  1. The number of clusters (k) is determined (randomly or by manual approximation).
  2. k number of centroids are inserted/into the n-dimensional data field or k number of random data points are chosen as the centroids.
  3. The data points are assigned clusters based on the centroid closest to them (closeness is calculated as the euclidean distance between the data point and each of the centroids).
  4. The centroid is 'moved' to centre of it's cluster (the co-ordinates of the centroids are updated such that it is at a distance which roughly the mean distance from the data points in the cluster)

Step 4 will cause changes in the cluster such that new points might have to be added to the cluster or some points might have to be removed from the cluster as they might have discovered a new centroid closer to them. So steps 3 and 4 are repeated until convergence (no changes in the clusters after multiple iterations) or the maximum number of iterations have been reached.

https://www.naftaliharris.com/blog/visualizing-k-means-clustering/ a good simulation example.