tskit-dev / msprime

Simulate genealogical trees and genomic sequence data using population genetic models
GNU General Public License v3.0
170 stars 84 forks source link

AFS folding (polarised=False) #2269

Open sunyatin opened 3 months ago

sunyatin commented 3 months ago

Dear all,

I was checking variable implementations of the (joint) allele frequency spectrum folding and went through the polarisation subsection in msprime. I am not sure I clearly understood the procedure implemented in ts.allele_frequency_spectrum(..., polarised=False), so all my apologies if I misinterpreted something...

Consider for instance a simple two-population AFS, $\xi$ (with haploid sample sizes $n_0 = n_1$), I thought I understood that any site with derived allele counts $i$ and $j$ where $i+j=(n_0+n_1)/2$ would add a half-count in both $\xi(i,j)$ and $\xi(n_0-i,n_1-j)$. However, looking into the spectrum, it seems that all cells checking the condition of ambiguous allele minority do not equal their reverse (which I expect would be the case if indeed the said sites were following the implementation described previously).

Am I misunderstanding the msprime implementation of folding?

Many thanks for your lights!

All the best,

Rémi

petrelharp commented 3 months ago

Hm, well, maybe this bit answers your question? (It's talking about sites with more than two alleles)

Now, if we are computing the unpolarised AFS, we add one half to each entry of the folded output corresponding 
to each allele count including the ancestral allele. What does this mean? Well, polarised=False means that we 
cannot distinguish between an allele count of 6 and an allele count of 4. So, folding means that we would add our 
allele that is seen in 6 samples to AFS[4] instead of AFS[6]. So, in total, we will add 0.5 to each of AFS[4], AFS[3], 
and AFS[1]. This means that the sum of an unpolarised AFS will be equal to the total number of alleles that are 
inherited by any of the samples in the tree sequence, divided by two. Why one-half? Well, notice that if in fact 
the mutation that produced the b allele had instead produced an a allele, so that the site had only two alleles, 
with frequencies 7 and 3. Then, we would have added 0.5 to AFS[3] twice.

... if not, can you provide a concrete example (and the code to generate it)?

sunyatin commented 3 months ago

Thank you Peter! Well, this is the passage which I have a bit of a hard time extrapolating to a joint 2D-AFS, for instance. But here is an example.

I consider two populations of size 10000 diploids each, which diverged 6000 generations ago. I sample 2 diploids (4 haploids) in each population.

Herebelow I build a polarised and unpolarised (folded) joint 2D-AFS using only one site, to showcase the problem I am considering. Specifically, I am looking at a site where the MAF is 50% (the "ambiguous" minority I was referring to in my previous message). Here, the genotypes are [1,1,1,1] in Pop A and [0,0,0,0] in Pop B.

Everything is fine for the polarised 2D-AFS.

However, for the folded version, I would expect that both entries $\xi(4,0)$ and $\xi(0,4)$ have each the value of 0.5. This is not the case: the returned folded 2D-AFS has an entry of 1 in $\xi(0,4)$... Yet, there is no reason to assume that the site is "fully" in this configuration, given that it has equal chance to be in the reverse one, given that we don't know which allele is really the major one? I think I am still misunderstanding the folding procedure done in msprime...

import msprime as msp
import numpy as np

def get_demo(Pops):
    No = 10000.
    Ts = 6000.
    #
    Demo = msp.Demography()
    for pop in Pops:
        Demo.add_population(name = pop,
                    default_sampling_time = 0.,
                    growth_rate = 0.,
                    initially_active = True,
                    initial_size = No)
    Demo.add_population_split(time = Ts, derived = ["M"], ancestral = "A")
    return Demo

Pops = ["A", "M"]
n_diploid_samples_per_pop = 2

Samples = [msp.SampleSet(n_diploid_samples_per_pop, ploidy=2, population=pop, time=0.) for pop in Pops]
Demo = get_demo(Pops)

# Simulate
ts = msp.sim_ancestry(Samples, demography=Demo, sequence_length=1e6, random_seed=12)
tsm = msp.sim_mutations(ts, rate=1e-8, random_seed=12, model="binary")

# Extract only one focal site with ambiguous MAC
ambiguous_minor_allele_count = 0.5*np.sum(np.array([x.num_samples*x.ploidy for x in Samples]))
sites_id_to_remove = []
first = True
for v in tsm.variants():
    g = v.genotypes
    if (first is True) and (np.sum(g) == ambiguous_minor_allele_count):
        print(v)
        first = False
    else:
        sites_id_to_remove += [v.site.id]
tsm_sub = tsm.delete_sites(sites_id_to_remove)

# Resulting genotype matrix
print(tsm_sub.genotype_matrix())

# AFS
Afs_Pops = [tsm_sub.samples(0), tsm_sub.samples(1)]
print(tsm_sub.allele_frequency_spectrum(Afs_Pops, polarised = True, mode="site", span_normalise=False))
print(tsm_sub.allele_frequency_spectrum(Afs_Pops, polarised = False, mode="site", span_normalise=False))

So, this is the variant at the single retained site:

Variant(site=Site(id=1, position=1061.0, ancestral_state='0', mutations=[Mutation(id=1, site=1, node=12, derived_state='1', parent=-1, metadata=b'', time=26708.361655509398)], metadata=b''), alleles=('0', '1'), genotypes=array([1, 1, 1, 1, 0, 0, 0, 0], dtype=int8))

We can see there is a single mutation.

The corresponding genotypes: [[1 1 1 1 0 0 0 0]]

The polarised 2D-AFS:

[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]]

The unpolarised (folded) 2D-AFS:

[[0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]

But this is the folded 2D-AFS I would expect:

[[0. 0. 0. 0. 0.5]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0.5 0. 0. 0. 0.]]

Many thanks again for your lights!

sunyatin commented 3 months ago

Oh... this is interesting. If I now select a site with genotypes [[0 0 0 0 1 1 1 1]] (i.e. corresponding to the flipped version compared to my previous example) (here again, single mutation happening on the site), here are the 2D-AFS:

Polarised:

[[0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]

Folded:

[[0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]

So now, the folded is identical to the polarised...

NB. In all examples, I consider only biallelic sites.

Edit: Part of the code changed:

sites_id_to_remove = []
first = True
for v in tsm.variants():
    g = v.genotypes
    if (first is True) and (all((g-np.array([0,0,0,0, 1,1,1,1])) == 0)):
        print(v)
        first = False
    else:
        sites_id_to_remove += [v.site.id]
tsm_sub = tsm.delete_sites(sites_id_to_remove)
petrelharp commented 3 months ago

Ah, okay - thanks for the examples, this is much clearer (not that your initial message wasn't clear, but it's clearer to me now I've seen the example). I think that bit I quoted is irrelevant, since that's just about sites with more than two alleles. So, it sounds like you're expecting a matrix with half the counts in each of the two places they could be assigned to (i.e. , ξ[i,j] == ξ[n-i,m-j])? Instead, we're returning an array in which some of the elements have been zeroed out. For instance:

>>> tsm.allele_frequency_spectrum([tsm.samples(0)], span_normalise=False, polarised=True)
array([1117.,   85.,    2.,   49.,  591.])
>>> tsm.allele_frequency_spectrum([tsm.samples(0)], span_normalise=False, polarised=False)
array([1708.,  134.,    2.,    0.,    0.])

Looking at the joint AFS I'm surprised by what I see there:

>>> tsm.allele_frequency_spectrum([tsm.samples(i) for i in [0,1]], span_normalise=False, polarised=True)
array([[  0.,  91., 652.,   0., 374.],
       [ 85.,   0.,   0.,   0.,   0.],
       [  2.,   0.,   0.,   0.,   0.],
       [ 49.,   0.,   0.,   0.,   0.],
       [591.,   0.,   0.,   0.,   0.]])
>>> tsm.allele_frequency_spectrum([tsm.samples(i) for i in [0,1]], span_normalise=False, polarised=False)
array([[  0.,  91., 652.,   0., 965.],
       [ 85.,   0.,   0.,   0.,   0.],
       [  2.,   0.,   0.,   0.,   0.],
       [ 49.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.]])

Notice that the only slots that got added together were (4,0) and (0,4). However, this is all as expected - adjusting the parameters a bit so I get mutations in most of the slots:

Polarised:
[[ 0. 68.  7.  2.  0.]
 [64. 21. 12.  7.  1.]
 [15. 13.  5.  9.  3.]
 [ 5.  5.  9. 14.  6.]
 [ 1.  6.  8. 14.  0.]]
------------
Unpolarised:
[[ 0. 82. 15.  8.  1.]
 [70. 35. 21. 12.  0.]
 [18. 22.  5.  0.  0.]
 [ 6.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]]

The rule for which half of the entries are kept in the unpolarised version is something like "entries with i + j <= n /2". (It should be the case that out of all entries that aren't distinguishable with unpolarised data, only one of them is nonzero.)

Is this answering the question?

sunyatin commented 3 months ago

Thank you so much Peter! I now get it, and it is now quite clear with the rule what msprime is doing for the "non-distinguishable counts". So this is a bit different compared to how folded AFSs are calculated, for instance in fastsimcoal or Arlequin. If this is not too cumbersome, could be worth explaining the folding more explicitly in the documentation for the multi-jAFS case, as some people for instance (like me ^^) might want to use msprime for simulating theoretical AFSs and afterwards compare them to empirical AFSs, which might therefore be produced with a slightly different folding procedure using softwares specialized in handling observed data (e.g. two-entry half-value for non-distinguishable counts).

Thanks a lot for your help :)

petrelharp commented 3 months ago

I agree, better docs would be nice. Hm, it currently says:

 If there is more than one sample set, the returned array is “lower triangular” in a similar way.

I think we left it at this because the precise specification of which cells are empty is not at all easy to explain (because of the edge cases, basically). Do you have any suggestions? (And, how do those other programs handle it?)