matsengrp / gctree

GCtree: phylogenetic inference of genotype-collapsed trees
https://matsengrp.github.io/gctree
GNU General Public License v3.0
16 stars 2 forks source link

local branching metric harmonization #99

Open wsdewitt opened 2 years ago

wsdewitt commented 2 years ago

Background

There have been persistent discrepancies between the local branching metrics provided by gctree.CollapsedTree.local_branching() and those provided by partis, including on the unit test in gctree. Here I'll walk through analytical calculations of the integrals for some very simple trees, as a basis for discussion with @psathyrella about what the source of the discrepancy is.

Message passing scheme

For reference, the following excerpt from Neher et al. (2014) describes the LBI message passing scheme for computing the exponentially discounted tree length focal to each node:

image

Additional implementation details in gctree/partis

In gctree/partis there are a few additional steps/notes:

  1. In trees that don't have time-calibrated branch lengths, we use the number of mutations on a branch as an estimate of its branch length. If a branch has no mutations, it is assigned a default/prior branch length $\tau_0$ for the purpose of the message integration limits, where $\tau_0\le\tau$ is an additional parameter. This allows clonal offspring count toward LBI fitness, whereas using a branch length 0 would result in vanishing messages from clonal offspring.
  2. For the purpose of message computations, the root node is given a branch length of infinity (which integrates to $\tau$), as a heuristic to stabilize/smooth LB metrics near the root.
  3. In addition to the LBI of Eqn 19, we compute a local branching ratio (LBR), defined as $$LBRi = \frac{\sum{j}m_{\uparrow ij}}{m{\downarrow i}}.$$
  4. Although rule 3 results in nonzero root node LBR values (given the infinite root branch noted above, which integrates to $\tau$), the root node LBR is re-assigned to zero or NaN.

Single-clone tree

As a first exercise, consider a tree that is a clonal polytomy of 8 identical sampled cells (with no mutations):

image

In the notation of genotype-collapsed trees, this would be represented as a single node (let's denote it index $0$) with abundance 8:

image

Down message

According to rule 2 above, we add an infinite branch above the root of the tree, so $$m_{\downarrow 0} = \int_0^\infty e^{-x/\tau}dx = \tau.$$

Up messages

According to rule 1 above, each of the 8 clonal children $j$ contributes upward message $$m_{\uparrow j} = \int_0^{\tau_0} e^{-x/\tau}dx = \tau(1-e^{-\tau_0/\tau}).$$ So $$\sumj m{\uparrow j} = \sum_j \tau(1-e^{-\tau_0/\tau}) = 8\tau(1-e^{-\tau_0/\tau}).$$

LBI

$$\lambda_i = \sumj m{\uparrow ij} + m{\downarrow i} = \tau(1 + 8(1-e^{-\tau_0/\tau})).$$

If we take $\tau=\tau_0=1$, we have $$\lambda_i = 1 + 8(1-e^{-1}) \simeq 6.0569644706.$$

LBI from gctree

Using gctree we can validate this analytical result for $\tau=\tau_0=1$. First, create the tree object:

import ete3
import gctree

ete_tree = ete3.TreeNode(name=0)
ete_tree.abundance = 8
ete_tree.isotype = {}

tree = gctree.CollapsedTree()
tree.tree = ete_tree
tree.render("%%inline")

image Now, evaluate LBI for $\tau=\tau_0=1$:

tree.local_branching(tau=1, tau0=1)
print(tree.tree.LBI)
6.056964470628461