Cluster the training data sets using k-means

trz-maier commented 5 years ago

Cluster the data sets train_smpl, trainsmpl (apply required filters filters and/or attribute selections if needed), using the k-means algorithm:

first excluding the class attribute (useclassestoclustersevaluationtoachievethis).Thiswill emulate the situation when the learning is performed in unsupervised manner.
then including the class attribute. This will emulate the general data analysis scenario

arjunshenoymec commented 5 years ago

I have some experience with clustering algorithms particularly k-means. I can take this up. Will provide updates here.

arjunshenoymec commented 5 years ago

k-means is an unsupervised learning technique which uses clustering.

In an unsupervised learning technique there is no 'desired output' or 'labels' assigned along with the data. It is entirely upto the system to infer labels/patterns from the data. There are two main types of unsupervised learning techniques:

Clustering: Discovering patterns from the values in the data. The individual data points are grouped together such that points in the same group are 'similar' to each other. Eg: classifying movies, grouping customers based on purchasing behaviour etc.
Association: Discovering rules that underly the data. Eg: people who buy burgers also seem to buy chips.

Clustering models can be further classified based on how the data points are grouped.

Centroid models: The group to which a given data point belongs to is determined by it's distance from the 'centroid' of the group. eg: kmeans
Density models: data points in the 'vicinity' of each other are grouped together by certain rules. (eg: dbscan).
There is a third class known as Distribution models which uses probability measures to determine which group a particular data point belongs to (I haven't looked into them as of now).

Kmeans is an iterative centroid method. An iterative method is one in which certain parts of the algorithm are repeated until convergence or the specified max number of iterations has been reached. The steps in kmeans is as follows:

The number of clusters (k) is determined (randomly or by manual approximation).
k number of centroids are inserted/into the n-dimensional data field or k number of random data points are chosen as the centroids.
The data points are assigned clusters based on the centroid closest to them (closeness is calculated as the euclidean distance between the data point and each of the centroids).
The centroid is 'moved' to centre of it's cluster (the co-ordinates of the centroids are updated such that it is at a distance which roughly the mean distance from the data points in the cluster)

Step 4 will cause changes in the cluster such that new points might have to be added to the cluster or some points might have to be removed from the cluster as they might have discovered a new centroid closer to them. So steps 3 and 4 are repeated until convergence (no changes in the clusters after multiple iterations) or the maximum number of iterations have been reached.

https://www.naftaliharris.com/blog/visualizing-k-means-clustering/ a good simulation example.

trz-maier / hwu-dmml-group1

Cluster the training data sets using k-means #8