This is python implementation of hierarchical clustering. This implementation use dynamic programming approach.
It is a clustering algorithm, which clusters the datapoints in group. This algorithm follows aglomerative approach i.e. it starts with each datapoint as cluster and goes on merging the clusters based on similarity.
Complexity of algorithm: O(n^3)
Linkages are the distances(distance is directly proportional to similarity) between clusters. As stated in above algorithm, we go on merging the 2 nearest clusters. But the problem here is how to find the distance between 2 clusters?
There are multiple ways to find the clusters:
The distance considered is minimum distance between 2 clusters. Minimum distance between 2 datapoints in 2 clusters.
The distance considered is maximum distance between 2 clusters. Maximum distance between 2 datapoints in 2 clusters.
The distance considered is average of distances between 2 clusters. Average of distances between all the datapoints in 2 clusters.
The distance considered here is, the distance between centroids of two clusters.
We use the concept of dynamic programming (memorisation technique) to achieve better time complexity. We will be maintaining the distance matrix which will maintain the distances between all the datapoints present. Thus, if we have "n" number of datapoints our distance matrix will be of order nxn. This matrix will be symmetric matrix. Since we want to merge the two closest datapoints, we will find the minimum distance and we will merge the 2 clusters updating the clusters distance from all other datapoints. We will reapeat the above step untill all the datapoints are under one cluster.
For more detailed description of algorithm and code visit: http://www.hhundiwala.com/hie_clust