`guppy trim` cuts out the fat

matsen commented 12 years ago

Connor and I have been having issues doing PCA on big trees because this requires doing eigenproblems on square matrices that are 10,000+ in each dimension.

I realize this morning (head slap) that we just need to trim down the reference tree down to what is represented in the collection of placeruns. This will make everything faster, and will make things like PCA trees easier to visualize.

We will select a subset of the leaves (see below) and then take the subtree induced by those leaves.

That is, if we were to take the induced subtree of

   ^
  / \
 /\  \
 ab  /\
    c d

with the leaf set (a,d), we would get the tree

   ^
  / \
 /   \
 a    \
      d

with the branch lengths induced by adding branch lengths along edges that are not bifurcating. This is just like for the prepsim, but it's crucial that we "heal" the cuts along the edges, so that we don't have things like (((a)),b).

Now, how do we select the leaf set given a mass distribution? As a first go, I propose we travel from the root to the leaves, totaling up mass as we go. If the total mass once we arrive at the leaves is less than a threshold, then the leaf gets thrown out.

I hope this can unfold in several steps.

First, implement the above as guppy trim, which takes placerun list (placefile list or split placefile) and a --min-path-mass flag for the above-described cutoff. The mass will be calculated as: unitize the mass for each placerun individually and then take the overall average of those collections of masses. Seems like some code from guppy_mft could be factored out for this. This --min-path-mass flag will take a float argument whose default is 0.001 (and displayed in command line help). A note: all of this averaging may be expensive, but we don't really care about the exact attachment locations, so if there is some way of just summarizing on an edge-by-edge basis that's fine.

Please push that so that @cmccoy and @metasoarous can try it out as a preprocessing step to edge PCA etc. We will be curious to see how much a guppy trim step changes the shape of the PCA plots, and if it improves things on the superedge side.

Second, figure out how we can incorporate this with reference packages. The trick will be that then we will have jplace files that will not be the same tree as the reference package. My proposal is to relax this assumption, and rather allow for trees that are induced subtrees of trees in reference package. This will require a bit of work to write an appropriate validator. It seems to me that we can use the taxonomy just fine-- we will just use a subset of the species names.

matsen commented 12 years ago

What we would really like is to have the original placements filtered, not their versions after turning them into mass.

It seems to me that this can be obtained in several stages.

Convert all of the placeruns into mass and average them to get one big placerun.
Using this, decide on which leaves are going to get cut
Traverse the tree to figure out the numbering changes that will result, along with those edges that will get thrown out.
Go back through the list of placements, throwing out mass and placements that appear on edges that disappear.
Renormalize the placement like weight ratios and posterior probabilities.

This last step should make the lwr and pp sum to one. I would like to know what the other subcommands are that give non-normalized ones of these.

I would also like to spit the list of discarded read names to STDOUT (with a "Discarded reads:" header if this is nonempty), unless a --discarded flag is supplied. If specified, this will specify a file into which the list of discarded reads should be thrown.

koadman commented 12 years ago

I have just come across this same issue, and the related issue of visualizing the edge PCA on a giant tree. In my datasets the samples will have few to no reads on many of the branches, and it would be nice to have the option of not including those taxa in the resulting visualization. One way I thought I could do this was the rather circuitous approach of placing all the samples, using rppr voronoi to cut down the ref package, re-placing the reads, then doing pca. But that is not ideal for various reasons.

matsen commented 12 years ago

We're working on this one and it should be complete in the next couple of days.

On Mon, Feb 6, 2012 at 1:26 PM, Aaron Darling reply@reply.github.com wrote:

I have just come across this same issue, and the related issue of visualizing the edge PCA on a giant tree. In my datasets the samples will have few to no reads on many of the branches, and it would be nice to have the option of not including those taxa in the resulting visualization. One way I thought I could do this was the rather circuitous approach of placing all the samples, using rppr voronoi to cut down the ref package, re-placing the reads, then doing pca. But that is not ideal for various reasons.

Reply to this email directly or view it on GitHub: https://github.com/matsen/pplacer/issues/216#issuecomment-3837249

Frederick "Erick" Matsen, Assistant Member Fred Hutchinson Cancer Research Center http://matsen.fhcrc.org/

matsen / pplacer

`guppy trim` cuts out the fat #216