magikker / TreeHouse-Private

TreeHouse development.
GNU General Public License v3.0
0 stars 0 forks source link

Silhouette (cluster quality metric) #19

Closed magikker closed 11 years ago

magikker commented 11 years ago

What: This would be a function that would print sillhouette widths for a input groups of trees. This is one of the most basic methods for evaluating cluster quality. It is defined here. http://en.wikipedia.org/wiki/Silhouette_%28clustering%29

Title: Silhouette

Input/Output: It would take mulitple groups of trees (so a vector of sets of trees?) and return the sillhouette widths. .

Implementation ideas: This can be implemented before we have a clustering method as we can treat groups of trees assigned by the user or by a search to be our clusters.

Implementation will depend on a similarity measure of some sort. Number of bipartitions in common is probably the easiest as we can just ping the bipartition table for that data.

Possible Variations: Other measure of similarity ( quartets? common-taxa?)

magikker commented 11 years ago

From Jarrett Holtz:

So I'm looking at working on the silhouette issue in the tracker, and I'm just looking for some input on if there is a better way to do this in TreeHouse.

My current idea which seem like it would work, is to get the bipartitions for every tree, compare those bipartitions and store the difference in bipartitions between every tree in the set. I would then need to iterate over these distances once for every tree to determine the average distance to all other trees in the cluster it is contained in, and once to compute the smallest average distance to another cluster which would be it's "neighboring cluster". Then I would have to use those values for each tree to compute the actual silhouette width for each tree.

This all seems like a lot of computation. Is there some easily discovered superior method for this in TreeHouse (or in general), that I haven't thought of, or is that something that I should attempt to solve?

magikker commented 11 years ago

Ok, so all the clustering stuff whether just plain clustering on the clustering quality metrics (like sillhouette) require some measure of similarity between the things being clustered.

So, maybe the first step is a distance function. Now Habitat, which TreeHouse is part of, has HashRF and Phlash which are fast methods for producing tree distance. One option is to call functions from one of those programs to get your distances, another option is to use them to compute a whole distance matrix for the trees via the commandline interface with one of the programs, then read the matrix in. I can upload those programs to the current repo if you'd like to take a look.

The another option is to write something compute distance. Odds are that it wouldn't be as fast as phlash but it'd probably work... If you want to pursue this option I'd be tempted to do things a bit different than what's in HashRF and Phlash as to not repeat work. One thing that I've considered is creating bitstrings for each tree that would have 1 for the presence of each bipartition and 0 for the absences. So it'd be a matrix trees long by unique_bpiartitons wide. You could treat the bitstrings like vectors in unique_bpiartitons-deminsional-space.... and the distance between two vectors can be the cosine between them. That'd be fairly fast, and not what'd being done in Phlash or HashRF (at least I'm fairly certain.)

TreeHouse has an inverted_index data structure that came over from TreeZip (I believe). It's the number of trees long and for each tree there's a vector of which bipartitions that tree has. This is structure is going to be a good starting point for creating a distance measure.

I'm not sure if there's anything in TreeHouse already that'll help with the amount of computation that needs to be done for a function like this.... but what I can suggest is to write things modularly. That way you can start by brute forcing a solution (if you so chose) but then plug in more elegant functions later... plus you could imagine doing silhouette based on multiple distance measures. Ideally it'd be really easy to plug a different one in.

Finally, if you need a new data structure, or think we should modify one we have in order to accomplish something we can do that. Everything is easier when it's laid out in the right data structure for the job.

I'm going to append this and your email to the issue, encase someone else in the future wants to use it for reference, Grant

macember commented 11 years ago

In response to the inverted tree table, I thought that didn't exist?

In Mark Adamo's thesis, he talks about how the "TreeTable" does not benefit from redundancy because trees can share many of the same bipartitions, and for that reason he created get_tree_bipartitions() to return the bipartitions that a given tree has

So, do we still have an inverted tree table or was it removed by Mark?

On Tue, Jun 11, 2013 at 2:11 PM, magikker notifications@github.com wrote:

Ok, so all the clustering stuff whether just plain clustering on the clustering quality metrics (like sillhouette) require some measure of similarity between the things being clustered.

So, maybe the first step is a distance function. Now Habitat, which TreeHouse is part of, has HashRF and Phlash which are fast methods for producing tree distance. One option is to call functions from one of those programs to get your distances, another option is to use them to compute a whole distance matrix for the trees via the commandline interface with one of the programs, then read the matrix in. I can upload those programs to the current repo if you'd like to take a look.

The another option is to write something compute distance. Odds are that it wouldn't be as fast as phlash but it'd probably work... If you want to pursue this option I'd be tempted to do things a bit different than what's in HashRF and Phlash as to not repeat work. One thing that I've considered is creating bitstrings for each tree that would have 1 for the presence of each bipartition and 0 for the absences. So it'd be a matrix trees long by unique_bpiartitons wide. You could treat the bitstrings like vectors in unique_bpiartitons-deminsional-space.... and the distance between two vectors can be the cosine between them. That'd be fairly fast, and not what'd being done in Phlash or HashRF (at least I'm fairly certain.)

TreeHouse has an inverted_index data structure that came over from TreeZip (I believe). It's the number of trees long and for each tree there's a vector of which bipartitions that tree has. This is structure is going to be a good starting point for creating a distance measure.

I'm not sure if there's anything in TreeHouse already that'll help with the amount of computation that needs to be done for a function like this.... but what I can suggest is to write things modularly. That way you can start by brute forcing a solution (if you so chose) but then plug in more elegant functions later... plus you could imagine doing silhouette based on multiple distance measures. Ideally it'd be really easy to plug a different one in.

Finally, if you need a new data structure, or think we should modify one we have in order to accomplish something we can do that. Everything is easier when it's laid out in the right data structure for the job.

I'm going to append this and your email to the issue, encase someone else in the future wants to use it for reference, Grant

— Reply to this email directly or view it on GitHubhttps://github.com/magikker/TreeHouse-Private/issues/19#issuecomment-19281505 .

magikker commented 11 years ago

So we've still got an inverted index, which is is a <vector vector>... the outer vector is the number of trees long. Each inner vector has the bipartition indexes associated with that tree.

I'd have to dig through his commit to see if the inverted index should have been removed and wasn't... or if he removed a different redundant data structure. But what I do know is that the inverted index is still there.

jaHoltz commented 11 years ago

So I pushed a solution to this issue, here's what I did.

Starting with the input, currently it only accepts two clusters as two separate vectors of ints. I would like to change this to a vector of sets of trees, but I still don't really have much of an idea on how to to set that up in the UserFunctions file so that it won't just return errors.

Then the distance measure, I wrote my own essentially. I collected the bipartitions for each tree and stored them in a vector for each tree. Then I took the intersection of these vectors for the trees I was comparing, and subtracted the size of the resulting intersection vector from the sizes of the original bipartition vectors for each tree. I added these together and divided by two to get the RF distance between the two trees. My helper function does this for every pair of trees and stores them in a 2d vector (some of which is unnecessary for homogeneous trees but maybe not for heterogeneous trees in the future). Since this all happens in an outside function it shouldn't be too hard to use different distance measure as we acquire them.

Using the 2d vector of distances of then compute the average distance to the cluster containing each tree, and the average distance to the closest cluster (neighboring cluster) using two separate outside functions that return these as a vector of floats length = # of trees.

Those values being the necessary values to compute the silhouette I then use those two vectors to compute the silhouette and return a vector containing the silhouette for each tree.

On Thu, Jun 13, 2013 at 11:19 AM, magikker notifications@github.com wrote:

So we've still got an inverted index, which is is a >... the outer vector is the number of trees long. Each inner vector has the bipartition indexes associated with that tree.

I'd have to dig through his commit to see if the inverted index should have been removed and wasn't... or if he removed a different redundant data structure. But what I do know is that the inverted index is still there.

— Reply to this email directly or view it on GitHubhttps://github.com/magikker/TreeHouse-Private/issues/19#issuecomment-19398720 .

magikker commented 11 years ago

Awesome great work.

I was able to test this today. It seemed to work, though we probably need a known test case to be sure. I think we can comment some of the print statements and just print the widths at the cluster level (I tried it on the big data set and tree by tree is too much information) and return the width of each input cluster as a float to the commandline, that way if someone want's to do something with those numbers they can.

jaHoltz commented 11 years ago

I agree, a known test case would definitely be the best way to make sure that silhouette does what is intended. I'll try and work on that at some point. I commented out the print statement today and replaced it with a cluster level width statement, as well as changing the return from outputting tree level silhouette width to cluster level silhouette width.

marclsmith commented 11 years ago

We should make creating a test case our number one priority. If we can't construct a test case, how will we know when we've implemented silhouettes correctly?

On Mon, Jun 17, 2013 at 2:30 PM, jaHoltz notifications@github.com wrote:

I agree, a known test case would definitely be the best way to make sure that silhouette does what is intended. I'll try and work on that at some point. I commented out the print statement today and replaced it with a cluster level width statement, as well as changing the return from outputting tree level silhouette width to cluster level silhouette width.

— Reply to this email directly or view it on GitHubhttps://github.com/magikker/TreeHouse-Private/issues/19#issuecomment-19564619 .

jaHoltz commented 11 years ago

Alright, so the base case is that the silhouette of any 1 tree in a cluster by itself should always be 1. That's true in the current silhouette function, and the easiest way for me to test anything more specific is to see the trees I have to work with easily and determine the silhouettes between subsets of them by hand. Currently user functions is broken though, so it's hard to look at the trees in the demo set, which seems like the best way to write tests of the function. So I will write test cases for silhouette and ensure that it functions as intended as soon as I can do that.

On Mon, Jun 17, 2013 at 2:52 PM, Marc Smith notifications@github.comwrote:

We should make creating a test case our number one priority. If we can't construct a test case, how will we know when we've implemented silhouettes correctly?

On Mon, Jun 17, 2013 at 2:30 PM, jaHoltz notifications@github.com wrote:

I agree, a known test case would definitely be the best way to make sure that silhouette does what is intended. I'll try and work on that at some point. I commented out the print statement today and replaced it with a cluster level width statement, as well as changing the return from outputting tree level silhouette width to cluster level silhouette width.

— Reply to this email directly or view it on GitHub< https://github.com/magikker/TreeHouse-Private/issues/19#issuecomment-19564619>

.

— Reply to this email directly or view it on GitHubhttps://github.com/magikker/TreeHouse-Private/issues/19#issuecomment-19566145 .

jaHoltz commented 11 years ago

Created a known test case for silhouette uncommenting testclust() from Treehouse.cc will run the tests on agglo_clust and silhouette, comments in the test function in AnalysisFunctions include the values I calculated as the silhouette of the function in question. There's probably a cleaner way to set up those tests, but it seems like silhouette computes proper values so far.