open-connectome-classes / StatConn-Spring-2015-Info

introductory material
18 stars 4 forks source link

Standard Clustering Algorithm? #82

Open sgomezr opened 9 years ago

sgomezr commented 9 years ago

After reading other "issues", I've noticed that the selection of the clustering algorithm depends solely on the researcher and on the purpose of the research project.

Given that there are many types of clustering algorithms (centroid models, hierarchical models, connectivity models, etc) and that they all may give different clustering arrangements, is there a standard depending on the type of project being carried out? I mean... let's say that we want to cluster nodes in a graph, and use two different algorithms, one clusters the data in the same way as we were predicting but the other one fails... how do we know which one to trust? How does the research community decide when to trust the results? In the case that such "standard" haven't been established, do you think it is possible to create a future standard?

SandyaS72 commented 9 years ago

I think the whole reason that a standard hasn't been established is because each clustering algorithm takes into account different characteristics of the graph that may or may not be what you want in each case. It's more important to understand what each one does so you can pick the one that makes the most sense. You can also try several methods like you suggested, but if one "makes sense" and the other doesn't, maybe it's worth trying to figure out on what basis the other one seems to cluster. So like the example we saw at the beginning of class on Tuesday- where k means seemed to cluster based on everything connected to a particular vertex, which wasn't what was "intuitive". In that case, maybe that's not what we wanted, but there might be cases where that's the way the graph is structured, so it would make sense. Long story short, the lack of a "standard" might be intentional. It's probably worse to just blindly apply a single algorithm to everything without thinking about if it makes sense to.

maxcollard commented 9 years ago

In a related interpretation to Sandya's, which kinds of clustering algorithms "work" for your data (read: the algorithms that gives somehow interpretable results) could tell you something about the data. For example, if your data are separable by LDA (i.e. there are straight "lines" that divide the different clusters), that tells you something about the underlying structure of our data. Similarly, if your data are not separable by LDA, but are separable by kmeans, that tells you something different about the "shape" of your data.

Practically speaking, there are a few common approaches: (1) Become wedded to a particular algorithm. This is common among methods-ey people; usually, they are the one who developed the algorithm in question. (2) Select an algorithm that fits your assumptions about the "shape" of the data. This is good if you're right, not so much if you're wrong. This is also rather subjective (Is power really log-normal? Is that process really poisson?), and (3) Blast your data with every algorithm known to man and note what works and what doesn't. I think that this is a really important strategy that often gets overlooked because it is basically an admission that we don't know anything about the data (spoiler alert: we usually don't). This strategy will basically never be published (maybe in PLOS, since they seem to be acknowledging that publishing negative results is important), but can be very insightful. Unfortunately it also leads to a lot of false positives that have to be weeded out a la (2) before interpretations emerge.

How does the "scientific community judge"? I think it really depends on which "community" you're speaking to. IEEE will basically publish anything with p (q) < 0.05 by some reasonable (or sometimes unreasonable) statistic, regardless of how esoteric the method is, because the emphasis is more on signal processing ("We found signal in the noise!"). On the flip side, many high-level neuro journals seem to eschew bulky math-heavy methods as uninterpretable, and hence de facto agree on standard methods for certain tasks ("If LDA can't find it, it probably isn't real.").

TL;DR: No, there is not one algorithm to rule them all.