Open trz-maier opened 5 years ago
I have some experience with clustering algorithms particularly k-means. I can take this up. Will provide updates here.
k-means is an unsupervised learning technique which uses clustering.
In an unsupervised learning technique there is no 'desired output' or 'labels' assigned along with the data. It is entirely upto the system to infer labels/patterns from the data. There are two main types of unsupervised learning techniques:
Clustering models can be further classified based on how the data points are grouped.
Centroid models: The group to which a given data point belongs to is determined by it's distance from the 'centroid' of the group. eg: kmeans
Density models: data points in the 'vicinity' of each other are grouped together by certain rules. (eg: dbscan).
There is a third class known as Distribution models which uses probability measures to determine which group a particular data point belongs to (I haven't looked into them as of now).
Kmeans is an iterative centroid method. An iterative method is one in which certain parts of the algorithm are repeated until convergence or the specified max number of iterations has been reached. The steps in kmeans is as follows:
Step 4 will cause changes in the cluster such that new points might have to be added to the cluster or some points might have to be removed from the cluster as they might have discovered a new centroid closer to them. So steps 3 and 4 are repeated until convergence (no changes in the clusters after multiple iterations) or the maximum number of iterations have been reached.
https://www.naftaliharris.com/blog/visualizing-k-means-clustering/ a good simulation example.
Cluster the data sets train_smpl, trainsmpl