open-connectome-classes / StatConn-Spring-2015-Info

introductory material
18 stars 4 forks source link

Clustering data vs. graphs with no labels #87

Open dlee138 opened 9 years ago

dlee138 commented 9 years ago

When using a clustering algorithm, what differences occur between clustering a set of data (no edges, just sets of points in euclidean coordinate space) versus clustering a graph consisting of vertices and edges but no labels (vertices aren't assigned specific coordinate points and are only defined by an adjacency matrix).

jtmatterer commented 9 years ago

From the point of view of the clustering algorithm, it doesn't matter where your data came from (graph, etc), it takes a set of points in Euclidean space and finds good clusters for them. So, one answer to your question is none; there are no differences between clustering the vertices of a graph and Euclidean data.

However, I suspect this is not the answer you were looking for. Rather, you wanted to know why clustering the rows/columns of the adjacency matrix performed so poorly? This is an important distinction, because for the SBM, some embedding of vertices into R^d allow for very good theoretical guarantees of block recovery using clustering. So it's not that clustering graphs is necessarily bad, but the embedding into R^d provided by the adjacency matrix performs poorly for the clustering task at hand. (Furthermore, I contend that clustering the rows/columns of the adjacency matrix is a viable answer, but to a different question than the one we asked in class).

ajulian3 commented 9 years ago

Are there statistical tests to run to determine what clustering algorithm is the most optimal for utilizing adjacency matrices? Specifically, is it possible to develop an algorithm that determines which clustering method is the most informative for a given set of data?

edunnwe1 commented 9 years ago

Wouldn't the statistical decision theoretic be an algorithmic test? You specify that clustering occurs on the adjacency matrix and compare the loss and risk functionals for various different clustering algorithms?