tskit-dev / tskit

Population-scale genomics
MIT License
155 stars 73 forks source link

"mutation" mode for statistics #2982

Open petrelharp opened 2 months ago

petrelharp commented 2 months ago

In #2948 @tforest is working on adding time windows to statistics. How this should work for mode="branch" is clear; however, it is not clear for mode="site". (It doesn't make sense to have time windows for mode="node", btw.) The reason is that site mode sums over all alleles (thus mimicing what happens with real data); and if there's been multiple mutations, there could be more than one mutation at different times that led to the same allele.

So, what we'd really like to do, and what people probably mostly imagine is happening with mode="site", is to sum over all mutations, rather than alleles. This would be essentially equivalent to mode="branch", but measuring branch area by counting how many mutations are on it instead of doing span times length. So - we're proposing adding a new mode, called "mutation" to statistics, that does this.

In a bit more detail. Recall that to compute a statistic we have some weights and a summary function, $f()$. For a node $n$ let $w_T(n)$ be the total weight of all samples below the node in tree $T$. Then the branch-mode statistic is $$\sum_T |T| \sum_n \ell_T(n) f(w_T(n)) ,$$ where $|T|$ is the span of the tree and $\ell_T(n)$ is the length of the branch above $n$ in tree $T$. And, if we define $w_s(a)$ to be the total weight of all samples carrying allele $a$ at site $s$, then the site-mode stat is $$\sum_s \sum_a f(w_s(a)) .$$ We can rewrite this as a sum over trees, pedantically, as $$\sumT \sum{s \in T} \sum{a \in s} f(\sum{d_s(m) = a} w_T(nm)),$$ where the sums are over, respectively, trees; sites in that tree; and alleles at that site; while "$\sum{d_s(m) = a}$" means "the sum over all mutations at site $s$ whose derived allele is $a$"; and $n_m$ is the node for mutation $m$. This looks complicated but it's just what you'd imagine. So, we're proposing that mutation-mode just doesn't group together the distinct mutations producing the same allele: $$\sumT \sum{s \in T} \sum{a \in s} \sum{d_s(m) = a} f(w_T(n_m)) .$$ If polarised=False, then the sum would also include "the root", i.e., include a term for $f(\bar w - \sum_m w_T(n_m))$, as if there were a mutation to the ancestral state at the root as well. (And, maybe the stat should be polarised by default, unlike other stats?)

There have been some other requests for (essentially) this, mostly around divergence. By trying to compute exactly-what-you-get-from-sequences (in mode="site") we have in a sense removed the advantage of the tree sequence that lets us distinguish one from multiple mutations (in principle).

Finally: the branch stat is the expected value of the site stat, given the trees, under infinite sites neutral mutations. This mode would remove that last caveat: the branch stat is the expected value of the mutation stat, as long as mutations are neutral (and Poisson).

hyanwong commented 2 months ago

This sounds like a very good idea to me. I have several times wondered if such a thing should exist.