Clustering method - Githubissues

magikker commented 11 years ago

What: This would be a new function or set of functions which would take a set of trees (all trees?) and divide them into groups based on some measure.

Title: cluster (could use a better name?)

Input/Output: It would take a set of trees. And return clusters of trees. Some clustering measure would require other inputs, such as a the number of clusters, or a similarity measure.

Implementation:

There are tons of ways to approach clustering. Some require you to know how many clusters you want, others "discover" clusters as they go. I might start exploring ideas in the "Hierarchical clustering" class of algorithms as it doesn't require you to know how many clusters you're looking for.

You'll need similarity/distance measures between trees, but also ones for groups of trees. You'll also need to implement measure to evaluate cluster quality.

Possible Variations: Endless.

Agglomerative Hierarchical clustering Divisive Hierarchical clustering K-means Distribution-based clustering ....

jaHoltz commented 11 years ago

Up and pushed is an implementation of agglomerative hierarchical clustering (in AnalysisFunctions.cc) named agglo_clust. My implementation takes a set of trees and returns a vector of treesets.

The process is fairly simple, it stores all of the trees in a map as individual clusters, and then copies all of the indexes to a set remaining, currently it then computes the distance using the distance function I wrote for silhouette (may eventually either take in a distance method or a distance matrix if that's preferable).

Then using a stack of pairs representing cluster keys from the map and the distance of that cluster to it's parent in the stack it computes clusters to be merged. The stack is initialized to some element from remaining, and then a function is run to find the closest neighbor. The closest neighbor is a pair who's second value is its distance from the top item in the stack. If that distance is less than the distance between the top cluster and its parent, we simply add the neighbor we found to the stack (removing it from remaining), and repeat. If the neighbor is less similar to the top cluster than the parent, we put the neighbor back in remaining and merge the top cluster and it's parent (they are as similar as it is possible to be).

After merging the two clusters in the cluster map the distance matrix is then recomputed with the new cluster in mind, and the process is repeated. This ends when the number of clusters in the map is equal to the number of desired clusters (currently set to 4). The map is then moved to the return vector, because the cluster ids are kind of arbitrary for later use.

I've tested this with a small number of cases to see how it works, and it appears to function. It forms clusters, although whether it does this as fast/efficiently as possible I'm not sure.

magikker commented 11 years ago

Tested on the new NSArobs dataset.

a = agglo_clust({0..100})

Cluster : 0 Trees : 0 Cluster : 1 Trees : 1 2 Cluster : 3 Trees : 3 4 5 6 7 8 Cluster : 9 Trees : 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

a = [{0}, {}, {}, {9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73}]

It seems like the full clustering isn't returned a. Cluster 1 and 2 are missing, and the last cluster is missing values.

Cluster : 9 What are you printing here? The cluster number or the number of trees? I think you've ended up printing the first tree.

Aside from that it looks like it's clustering. Awesome!

jaHoltz commented 11 years ago

That number is just an arbitrary cluster value. It isn't really important to keep in the future, it's the first tree because that's the original key of the cluster in the map. I'm not even really sure why I kept or printed it out. It doesn't actually affect anything.

As for the values on the return, I have no idea why they aren't all being kept in the return set, I'll look into that. Obviously all of the values exist because I'm able to print them out. Also...The clustering seems very odd. I wish I knew more about how sensible it was to cluster so many things in the last cluster. I'm going to try and test that myself and silhouette the output just to see what happens.

I wonder if the weird return values for vectors of treesets have anything to do with the odd return value on the clustering algorithm?

On Tue, Jun 18, 2013 at 2:03 PM, magikker notifications@github.com wrote:

Tested on the new NSArobs dataset.

a = agglo_clust({0..100})

Cluster : 0 Trees : 0 Cluster : 1 Trees : 1 2 Cluster : 3 Trees : 3 4 5 6 7 8 Cluster : 9 Trees : 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

a = [{0}, {}, {}, {9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73}]

It seems like the full clustering isn't returned a. Cluster 1 and 2 are missing, and the last cluster is missing values.

Cluster : 9 What are you printing here? The cluster number or the number of trees? I think you've ended up printing the first tree.

Aside from that it looks like it's clustering. Awesome!

— Reply to this email directly or view it on GitHubhttps://github.com/magikker/TreeHouse-Private/issues/9#issuecomment-19629243 .

magikker commented 11 years ago

I actually think I know what's going on there. ... it's something with duplicate trees or with vectors of sets... it's not your clustering stuff. I'm looking into it.

jaHoltz commented 11 years ago

I'd just decided that the two problems were definitely related based on the odd things it does when you store vectors of treesets in interactive mode. It seems to keep arbitrary elements in some situations and leave empty sets in others. Does it have something to do with pqlsymbol.h and value_to_string() maybe?

On Tue, Jun 18, 2013 at 2:22 PM, magikker notifications@github.com wrote:

I actually think I know what's going on there. ... it's something with duplicate trees or with vectors of sets... it's not your clustering stuff. I'm looking into it.

— Reply to this email directly or view it on GitHubhttps://github.com/magikker/TreeHouse-Private/issues/9#issuecomment-19630513 .

magikker commented 11 years ago

yeah, something's up with how they're being stored and/or recalled. I'm not sure what yet, but I need to do some digging in PQLsymbol.

magikker commented 11 years ago

Got it fixed on my end so it would seem like clustering is working now.

jaHoltz commented 11 years ago

So agglo_clust appears to favor generating one or two clusters that are very large and leaving all of the others sets to be very small. Somehow I don't think that's a useful behavior unless it's just caused by the datasets. Not to mention that while it doesn't seem to do it on the demo, for both big demo and the medium dataset it essentially clusters them on order (anything that gets clustered and isn't just left by itself is with consecutively indexed trees). I'm looking to see if that's an error, or something I've overlooked, but so far I haven't found anything.

On Tue, Jun 18, 2013 at 3:51 PM, magikker notifications@github.com wrote:

Got it fixed on my end so it would seem like clustering is working now.

— Reply to this email directly or view it on GitHubhttps://github.com/magikker/TreeHouse-Private/issues/9#issuecomment-19636771 .

magikker commented 11 years ago

Oh, wow... Ok, something is checked in and overwriting your stuff... and I've got no idea what it is. but yeah, make clean, ./configure and make.

jaHoltz commented 11 years ago

It seems like the trees in the medium dataset are fairly similar in general. In fact some of them are identical as far as I can tell. I suspect that's the reason for the ordering, and possibly for the large datasets being created. From that and looking at the code I can't see anything wrong with agglo_clust. So it appears to work as intended.

magikker commented 11 years ago

Right, so with real data sets we expect to see a certain degree of similarity. As the tree inference algorithm goes through treespace it's jumping from one tree to the next and we kinda expect those trees have some similarity based on ordering.

We can either find, or craft a data set to better test the clustering power of agglo_clust... I've got a set of 20K trees that came from two 10K runs of Mr. Bayes. The trees in each run are fairly self similar but there are differences between runs. We could try it on them if you're interested.

jaHoltz commented 11 years ago

I'd be willing to try it on that set to see how that works out. If it works well, all the better, if it doesn't there isn't any reason we shouldn't want to find out.

On Wed, Jun 19, 2013 at 2:56 PM, magikker notifications@github.com wrote:

Right, so with real data sets we expect to see a certain degree of similarity. As the tree inference algorithm goes through treespace it's jumping from one tree to the next and we kinda expect those trees have some similarity based on ordering.

We can either find, or craft a data set to better test the clustering power of agglo_clust... I've got a set of 20K trees that came from two 10K runs of Mr. Bayes. The trees in each run are fairly self similar but there are differences between runs. We could try it on them if you're interested.

— Reply to this email directly or view it on GitHubhttps://github.com/magikker/TreeHouse-Private/issues/9#issuecomment-19705834 .

magikker commented 11 years ago

New set added, it's in the data folder under "no branch" things are way smaller with no branch lengths.

Grant

jaHoltz commented 11 years ago

I've been running random sets through the agglo_clust function, and I think something needs to be decided about the number of clusters it should return. I'm thinking it should probably just take the number of clusters desired as an input, but if anyone has any better ideas I could go with something else as well. As for the results of running a lot of different sets in the newest trz file it still seems to like to put them together by ordering, but it isn't as extreme if you use data from both halves of the data and not just one half. It also responds strangely to being given multiple sequences of trees like so {1..50, 75..100}, I'm not actually sure treehouse interprets that the way I think it does.

On Wed, Jun 19, 2013 at 3:17 PM, magikker notifications@github.com wrote:

New set added, it's in the data folder under "no branch" things are way smaller with no branch lengths.

Grant

— Reply to this email directly or view it on GitHubhttps://github.com/magikker/TreeHouse-Private/issues/9#issuecomment-19707369 .

magikker commented 11 years ago

mds-taxa-150-trees-20000-k-2-clusters-0

magikker commented 11 years ago

normalizedrf

jaHoltz commented 11 years ago

So I was going to run phlash from TreeHouse to check my distance matrix vs the phlash generated one, but I don't actually have phlash in the location it's expected to be, or at all I don't believe. Where can I get phlash to use for this, or is it just not where I expect it to be?

magikker commented 11 years ago

I'm trying to make sure I didn't break too many things with my global removal stuff but I'll get phlash uploaded as part of the project soon.

jaHoltz commented 11 years ago

The graph generated by the code I've been using and gnuplot test

The graph generated by R multidimensional scaling of the same dataset (and also graphed with R) graph

In general neither of them is especially similar to the graph you have, so it would be useful to know what portion of the data from NSArob you computed your graph from. I've been using the first thousand of each of the runs. (So 2000 points).

magikker commented 11 years ago

Is that NSArobs? the data that parses so neatly is the one in the "no branch" folder. NSArobs is kinda expected to be a mess.

jaHoltz commented 11 years ago

Hmm.. It sounds like I had the two sets confused, that would explain why the data doesn't look the same. I'll go back and work on it with the other dataset.

On Fri, Jul 5, 2013 at 2:26 PM, magikker notifications@github.com wrote:

Is that NSArobs? the data that parses so neatly is the one in the "no branch" folder. NSArobs is kinda expected to be a mess.

— Reply to this email directly or view it on GitHubhttps://github.com/magikker/TreeHouse-Private/issues/9#issuecomment-20532693 .

jaHoltz commented 11 years ago

So after switching to the set I was actually supposed to be using for this the data looks like it was always intended to look. I'm not going to post a picture, but it looks just like you expected. I'm sorry about the confusion caused by mixing up the two sets we had to work with here.

marclsmith commented 11 years ago

Great news, and well done, Jarrett! :-) Thanks, Marc

On Mon, Jul 8, 2013 at 9:55 AM, jaHoltz notifications@github.com wrote:

So after switching to the set I was actually supposed to be using for this the data looks like it was always intended to look. I'm not going to post a picture, but it looks just like you expected. I'm sorry about the confusion caused by mixing up the two sets we had to work with here.

— Reply to this email directly or view it on GitHubhttps://github.com/magikker/TreeHouse-Private/issues/9#issuecomment-20606649 .

magikker commented 11 years ago

Great News. So I'd guess you've probably got a working clustering function and distance functions. Very cool.

magikker / TreeHouse-Private

Clustering method #9