matsen / pplacer

Phylogenetic placement and downstream analysis
http://matsen.fredhutch.org/pplacer/
GNU General Public License v3.0
75 stars 18 forks source link

Have mft be able to invert the effect of copy number #181

Closed matsen closed 12 years ago

matsen commented 13 years ago

This issue is a very good idea by @skembel, and has been discussed with Connor, who knows what's going on too. Don't hesitate to contact them with questions.

Motivation

Sometimes the gene that we are using for our phylogenetic analysis comes in multiple copies. That leads to artifactually high apparent numbers of reads for some taxa, and artifactually low for others. We would like for guppy mft to be able to rescale so that this effect is canceled out to the best of our ability.

Implementation

This issue will make a --copy-number flag for mft, that will work as follows. For each placement x for a given pquery we will infer the copy number for that placement, say c_x. Assume that the confidence of placement x (either ML weight ratio or PP, depending on the --pp flag) is p_x. Then the expected copy number for the pquery is the sum of p_x times c_x across x. Call this number C.

We then scale the mass for that pquery by dividing by C. That is, if that pquery had multiplicity 2, and C was 5, then the resulting mass in the namlom will be a Named_float with value 2/5.

Algorithm

We infer copy number by assuming that the change along edges of a phylogenetic tree arises by Brownian motion. This leads to the following algorithm for estimating the copy number at the root of a phylogenetic tree. The algorithm proceeds recursively up the tree, estimating copy number at every internal node. Now, say that a given internal node n has k children, and that the copy number c_i has been inferred at each one of them (1 <= i <= k). Assume that the branches leading to the k children have length q_i (1 <= i <= k).

We use this information to infer the copy number at "our" internal node n. The formula is c_n = (sum_i (c_i/q_i)) / (sum_i (1/q_i)). We proceed up the tree in this fashion, eventually inferring the value at the root node.

Now, in general we don't to infer the value at the root node, we want to infer the value for a given placement. It turns out that the inferred value for a given placement is equivalent to rerooting the tree at the attachment location of the placement and then doing the inference of the root state for that rerooted tree.

We could do inference via rerooting like this. However, we would have to reroot the tree for every single placement of every pquery, which would be slow and not too elegant. I suggest the following, which is analogous to what happens in the Like_stree module. For every internal node, infer the distal copy number and the proximal copy number as follows. The distal copy number for a given internal node is the copy number inferred from the leaves below that node, doing the averaging procedure as described above. The proximal copy number for a given internal node n is the number that would be inferred for the tree of all of the leaves above n which has been rooted at the node just above n. The proximal copy number for the root node is undefined.

These proximal copy numbers can be calculated on a second pass through the tree. Say we have a situation as follows:

   |
   |
  /|\
 / | \
a  b  c

Call the internal node n (i.e. the MRCA of a, b, and c). As an example, we will compute the proximal copy number for the node b. Say the distal copy numbers for nodes a and c are c_a and c_c, and the branch lengths are q_a and q_c, respectively. Assume the proximal copy number for node n is p_n and the branch length above n is q_n. The proximal copy number for node b, then, is the weighted average of that for the other nodes, namely (c_a/q_a + c_c/q_c + p_n/q_n)/(1/q_a + 1/q_c + 1/q_n). Note that in the case where the MRCA of a, b, and c is the actual root of the tree, simply ignore that term. This can be achieved by setting p_n to zero and q_n to infinity (?).

Now, for a placement on a given edge e, find the distal and proximal copy numbers (c_d and c_p, respectively) for the internal node just below (distal to) the edge e. If the placement has distal_length q, and e is of length l, then the inferred copy number will be (c_d/q + c_p/(l-q))/(1/q + 1/(l-q)) just as if we had rerooted at the attachment location of the placement and did root node ancestral copy inference.

@skembel has R code to do this inference via rerooting; he will be sending this on. Please use his code to check things, first making sure that we are getting the right values for the distal inference, which is easy.

matsen commented 13 years ago

Whoops, typo in number for commit message. Duhhh...

cmccoy commented 13 years ago

For this to follow Felsenstein's independent contrasts, branch lengths to internal nodes should be lengthened by:

(l_1 * l_2) / (l_1 + l_2)

Where l_1 and l_2 are the branch lengths to the children. See page 10 of http://anolis.oeb.harvard.edu/~liam/programs/idc/doc/Felsenstein_1985.pdf