open-connectome-classes / StatConn-Spring-2015-Info

introductory material
18 stars 4 forks source link

clustering algorithms #71

Open ghost opened 9 years ago

ghost commented 9 years ago

I think there were a few other methods of clustering mentioned in class besides kmeans. What are some of them and how are they different than kmeans? Why did we pick kmeans to use in class?

gkiar commented 9 years ago

There are certainly many other algorithms of clustering than kmeans. The reason why we chose kmeans in class, if I'm not mistaken, is because it was assumed to be familiar to most people as it's a very common algorithm. That, and the fact that it was blurted out first.

A good place to start wrapping your head around the types of clustering algorithms is simply the wikipedia page about clustering (http://en.wikipedia.org/wiki/Cluster_analysis). K-means is a centroid-based clustering algorithm, which means that the clusters are determined based on optimizing a distance to one of K cluster centers based on the element characteristics. Connectivity based clustering algorithms, like hierarchical clustering, consider that nearby components are more likely to be connected than those far away. There are many other types of clustering as well, and the trick is either a) picking an algorithm, or b) formatting your data, such that you end up with clusters in your data the way you 'expect' or can find 'useful.'

dlee138 commented 9 years ago

Another interesting clustering algorithm that can be compared to k-means is the Expectation-maximization algorithm (EM). Unlike k-means, EM uses distributions instead of centroids to cluster. The algorithm tries to find a mixture of Gaussian distributions that will fit the data. While we know the functional form of the distribution, the parameters are unknown. To figure out these parameters we can start off by randomly initializing the means of the Gaussians (this would be analogous to how we randomly initialize cluster points in the k-means algorithm). The number of Gaussians can be predetermined by the experimenter the same way we select how many k cluster points there will be in the k-means algorithm.

mblohr commented 9 years ago

Some additional things to note about K-means (centroid-based) versus EM (model-based) clustering models is that K-means is NP hard and can become computationally intractable whereas EM is not, K-means assumes similarly-sized/shaped clusters and EM allows for variable-shaped clusters, and the K-means output is a hard assignment of each point to a cluster whereas the EM output is a soft assignment (i.e., probability) of each point belonging to each cluster or Gaussian (or whatever statistical model you are fitting in EM). Choosing the right cluster model depends a bit on the data (shape, number of points, distribution) you are clustering. As mentioned above, certain datasets might be best clustered by a hierarchical (connectivity-based) model and not a centroid-based or model-based approach. One example, shown in class, is a dataset comprised of concentric rings.

ajulian3 commented 9 years ago

The most effective algorithms from my experience are KNN and DBSCAN. Both are centroid based from my experience. KNN is helpful when training the data for further analysis. When given neural information regarding the presence of a disease, KNN will help classify further disease presence within patient data. However, this is more important in the standardization of multiple patient information.

jovo commented 9 years ago

@greg - remind me to talk about different clustering methods in class, and EM for fitting GMM, and KNN, and DBSCAN. thanks, j

On Thu, Feb 12, 2015 at 2:18 PM, ajulian3 notifications@github.com wrote:

The most effective algorithms from my experience are KNN and DBSCAN. Both are centroid based from my experience. KNN is helpful when training the data for further analysis. When given neural information regarding the presence of a disease, KNN will help classify further disease presence within patient data. However, this is more important in the standardization of multiple patient information.

— Reply to this email directly or view it on GitHub https://github.com/Statistical-Connectomics-Sp15/intro/issues/71#issuecomment-74133739 .

the glass is all full: half water, half air. openconnecto.me, we're hiring! https://docs.google.com/document/d/14SApYAzxF0Ddqg2ZCEwjmz3ht2TDhDmxyZI2ZP82_0U/edit?usp=sharing , jovo.me, my calendar https://www.google.com/calendar/embed?src=joshuav%40gmail.com&ctz=America/New_York