Closed nspope closed 2 months ago
Ah, I see this quote in the documentation:
This means that the sum of an unpolarised AFS will be equal to the total number of alleles that are inherited by any of the samples in the tree sequence, divided by two.
So I think branch-mode folded AFS satisfies this -- the sum of entries is 1/2 segregating sites. But the site-mode folded AFS does not -- the sum of the entries equals segregating sites. Is this expected?
Well gee this sure seems like a bug, but if so I'm surprised since I thought we'd thought hard about all these permutations (but there were a lot of them...). Let's see: the one half happens here; mirroring a similar accounting here. It's possible the (correct) one-half from the site
code got moved over into the branch
code without thinking hard enough about it. The situation is different because in the site
code there's the ancestral state that contributes to the calculation, while in the branch
code accounting for the ancestral state would look like this (here):
x = [tree.num_tracked_samples(node) for tree in trees]
not_x = [len(s) - tree.num_tracked_samples(node) for s, tree in zip(sample_sets, trees)]
# Note x must be a tuple for indexing to work
if polarised:
S[tuple(x)] += t.branch_length(node) * tr_len
else:
x = fold(x, out_dim)
S[tuple(x)] += 0.5 * t.branch_length(node) * tr_len
S[tuple(not_x)] += 0.5 * t.branch_length(node) * tr_len
However, that would be redundant, since fold(x) == fold(not_x)
.
Oh, and here's a MWE (doesn't show anything different to the above, just makes the situation totally clear):
import tskit
t = tskit.TableCollection(sequence_length=4.0)
a = t.nodes.add_row(time=2, flags=0)
b = t.nodes.add_row(time=1, flags=0)
t.edges.add_row(left=0, right=1, parent=a, child=b)
s = t.sites.add_row(position=b-1, ancestral_state='A')
t.mutations.add_row(site=s, derived_state='C', node=b)
n = t.nodes.add_row(time=0, flags=1)
t.edges.add_row(left=0, right=1, parent=a, child=n)
s = t.sites.add_row(position=n-1, ancestral_state='A')
t.mutations.add_row(site=s, derived_state='C', node=n)
n = t.nodes.add_row(time=0, flags=1)
t.edges.add_row(left=0, right=1, parent=b, child=n)
s = t.sites.add_row(position=n-1, ancestral_state='A')
t.mutations.add_row(site=s, derived_state='C', node=n)
n = t.nodes.add_row(time=0, flags=1)
t.edges.add_row(left=0, right=1, parent=b, child=n)
s = t.sites.add_row(position=n-1, ancestral_state='A')
t.mutations.add_row(site=s, derived_state='C', node=n)
t.sort()
ts = t.tree_sequence()
for p in (True, False):
for m in ('site', 'branch'):
print(['unpolarised', 'polarised'][p], m,
ts.allele_frequency_spectrum(polarised=p, mode=m, span_normalise=False))
which produces
polarised site [0. 3. 1. 0.]
polarised branch [0. 4. 1. 0.]
unpolarised site [0. 4. 0. 0.]
unpolarised branch [0. 2.5 0. 0. ]
Currently thinking it's a bug; will see if I still think that on Monday.
Just talked this through with @nate. The conclusion is:
site
version, sobranch
version by just first computing the polarized version and then folding it, instead of having separate algorithms.I probably owe @jeromekelleher an apology for not catching that the first time around!
No apologies required @petrelharp - this stuff is super confusing and I could also have spotted it!
I think we can just do a straight fix here and document the change as a Bug Fix, right?
We could
Note that the test in (3) should also hold for site stats if the mutations are done with infinite sites.
Option 1 + 3 seems good?
The following is a little unintuitive to me:
the entries of the branch-mode folded AFS are a factor of 2 from the site-mode folded AFS ... Shouldn't
mode='branch'
give the expectation exactly here; like it does for segregating sites and the unfolded AFS?