tskit-dev / tskit

Population-scale genomics
MIT License
155 stars 73 forks source link

Clarify which version of Fst we compute #858

Open jeromekelleher opened 4 years ago

jeromekelleher commented 4 years ago

In our docs for Fst we provide the formula for how it's computed, but don't explain which version of Fst this corresponds to. Is this one of the estimators (Hudson/Weir & Cockerham), or a "definitional" version?

ping @petrelharp

petrelharp commented 4 years ago

Looking at Bhatia et al, I think that what we've got is Nei's (1986) estimator. (I'd want to add a unit test comparing to the formula in Bhatia to verify that.) But, the branch statistic is defined in Slatkin 1991, and what we've got corresponds to the formula for Fst in Slatkin, who just cites Wright 1951 for it.

castedo commented 4 years ago

Rather than create a new issue I'll just hop on this one. Forgive me if I'm being annoying here. The Fst that is currently getting computed I doubt is the desired one. I can speculate that this is due to the confusing terminology of terms using the words "diversity" and "heterozygosity". I started a discussion about this in #961 and to me is seems a real problem with how the field of genetics uses these words.

I think any desirable interpretation of Fst should not take on negative values and should land between 0 and 1. I wouldn't be surprised that somebody has made some estimator that can go negative or go above 1. But I doubt you want to use one of those here. At the moment Fst is negative for a simple case of panmixia when it should be zero:

tables = tskit.TableCollection(sequence_length=1.0)
tables.nodes.set_columns(flags=[1,1,1,1,0,0], time=[0,0,0,0,1,2])
tables.edges.set_columns(left=[0,0,0,0,0], right=[1,1,1,1,1], parent=[4,4,5,5,5], child=[0,1,2,3,4])
tables.sites.add_row(position=0.5, ancestral_state='A')
tables.mutations.add_row(site=0, node=4, derived_state='a')
ts = tables.tree_sequence()
ts.Fst([[0,2],[1,3]])

results in array(-0.33333333)

If "diversity" in the documentation and code is changed from 'nucleotide diversity' to 'gene diversity' (see #961) then Fst will become zero in the above case of panmixia.

nspope commented 4 years ago

@castedo There's two issues here that I know of-- first, there's a lot of ambiguity regarding how Fst is defined in the population sense (e.g. when population allele frequencies are known). A recent-ish paper addressing this topic is Bhatia et al 2013. As you say, a sensible definition should give Fst in [0,1].

Second, for a given definition of Fst, there's the issue of estimating it from a finite sample, as we never know the population allele frequencies. This is often done via moments-based estimators, with a correction for finite-sample bias. That's the case for the estimator they propose in the Bhatia paper, and also the commonly used one from Reynolds, Weir, Cockerham 1983. The bias correction can make the estimator fall outside of [0,1].

I guess one way to circumvent the second issue would be to take the joint site frequency spectrum over a given tree sequence, and use hypergeometric projection to get unbiased estimators of the raw moments needed for a given definition of Fst. Similar to what is done in Ragsdale & Gravel 2019 to calculate various frequency-based statistics from haplotype frequency spectra. Then you should end up with an estimator that is both unbiased and respectful of bounds. I don't think I've seen this (projection-based approach) in the context of Fst, however.

Estimation of Fst has always been a can of worms. For most practical applications, negative Fst isn't really an issue, although it's conceptually distasteful.

castedo commented 4 years ago

With Fst changed to use specifically gene diversity and not nucleotide diversity, I think it can simply be clarified that it is the Wright 1949 F_ST model parameter (population sense) as long as it is clarified what is the probability model.

The current Fst code with the mentioned fix implies a probability model of choosing random sample nodes with equal probability that any one sample set gets chosen, regardless of how many nodes are in each sample set. So this means with samples set of different size, not all sample nodes are equally likely to be chosen. Nodes in larger sample sets are less likely to be chosen than nodes in smaller sample sets. This is in contrast to all samples being equally likely to be chosen no mater the size of each sample set.

use hypergeometric projection to get unbiased estimators of the raw moments

@nspope I wish I understood that, sounds awesome ... someday :)

negative Fst isn't really an issue, although it's conceptually distasteful.

There is also a desire to add F_IS and F_IT calculations and thus I suspect highly desirable to make sure

(1-F_IT) = (1-F_IS) * (1-F_ST)

holds.

ambiguity regarding how Fst is defined in the population sense

I've only started studying Fst this year but after reading all these papers, all the way back to Wright, it seems to me the main ambiguity is what underlying probability model has the F_ST model parameter or is being estimated. I've got a couple of old papers I've banged my head against where I think the "old" definition of F_ST is pretty clear. It's more people keep applying it in different ways and not being really clear what the underlying probability model is.

nspope commented 4 years ago

@castedo By hypergeometric projection I mean using hypergeometric sampling probabilities to map a set of dimension [N1,N2] (number of sample nodes in deme one and two) to a set of smaller dimension [n1,n2]. For example, this gives a linear map that "projects" a site frequency spectrum to a smaller dimension (when I say 'site frequency spectrum' I'm talking about sums of branch lengths, for branches that subtend a given number of sample nodes). In the case of Fst -- I'm thinking specifically of the definition in the Bhatia paper -- you can get the moments used in the numerator and denominator from SFS of dimension [2,2] and [1,1] respectively. So projection bypasses issues with unequal sample sizes across demes.

I think I'm bloating up this issue (and not taking the time to explain very clearly), but if this at all useful, I can give you a more concrete example elsewhere.

petrelharp commented 4 years ago

As you're pointing out... F_st is several cans of worms, because (a) there's more than one different thing it gets used to describe/estimate, (b) whether you think of it as a statistic or a population parameter we want an estimator of, and (c) if it's an estimator, how you do the bias correction.

The interpretation I like most is that relating mean TMRCAs within and between populations, as in Slatkin. That's easy, and basically identical to other methods when you're computing it over megabases with whole-genome sequence. As implemented we don't have an unbiased estimator of that because we're estimating a ratio by the ratio of two estimators. But, any differences will be neglegible for reasonably-sized windows. The fact that it does weird things on single sites doesn't bother me too much, because in any situation where it acts strangely you should really be looking at diversity and divergence separately.

I'm all for implementing some more Fst methods, by adding a method= argument, though - people do like their different measures of Fst!

I think we can define F_IS in the way that you say here, btw, since if T_I = mean TMRCA between chromosomes of an individual, T_S = mean TMRCA between chromosomes of a population, and T_T = mean TMRCA between chromosomes from the whole population, then

  T_I / T_T = (T_I / T_S) * (T_S / T_T)
jeromekelleher commented 4 years ago

I'm all for implementing some more Fst methods, by adding a method= argument, though - people do like their different measures of Fst!

Agreed!

castedo commented 4 years ago

Thank you @petrelharp and @nspope for these ideas and paper references. I'm adding them to my study reading/topic list.

I wish I has a better sense of what typical tskit users want to see in documentation and calculations. I'm afraid my sense is poor on this. However, for documentation or deciding what calculations to make, maybe a study activity I've been doing can be useful. I've been writing "Open Study Questions/Answers" like this one:

OSQ #1033: What is a formal definition of FST?

and I've been working out formal mathematical answers. I'm planning to do an update to the "Open Study Answer" (OSA) for F_ST and add more questions for F_IT, F_IS and probably heterozygosity and diversity.

Maybe it will be useful to lift parts of these answers into documentation in the future. If anyone would like to review them, provide feedback or collaborate on them please let me know.

I have not updated my OSA for F_ST, but I have an update in mind such that the T variables @petrelharp mentions correspond as such:

T_T ~ Var[A] T_S ~ E[Var(A|S)] T_I ~ E[Var(A|I)]

and for F_ST like this:

F_ST = Var[E(A|S)] / Var[A] F_IT = Var[E(A|I)] / Var[A] F_IS = Var[E(A|I)] / E[Var(A|S)] 1 - F_ST = E[Var(A|S)] / Var[A] 1 - F_IT = E[Var(A|I)] / Var[A] 1 - F_IS = E[Var(A|I)] / E[Var(A|S)]

where

A is a random vector of bits representing the allele states of sites on a random chromosome S is a random variable assigning chromosomes/gametes to sub-populations I is a random variable assigning chromosomes/gametes to individuals

and with a special choice of A it further follows that

Var[A] is gene diversity Var(A|S) is gene diversity per sub-population E[Var(A|S)] is 'expected heterozygosity' 2 Var(A|I) is heterozygosity per individual 2 E[Var(A|I)] is 'observed heterozygosity'

One of the many things I like about this formalism of F_ST is it corresponds exactly to the early "old-school" definitions of F_ST, like Wright's, before the decades when all kinds of variants were getting published. And it becomes exactly "variance (or diversity) explained by sub-populations status of chromosomes/gametes."

jeromekelleher commented 4 years ago

Lots of ink has been spilled over what Fst is/isn't/should be interpreted/etc, and our documentation isn't the place to settle these questions. I think the documentation should

There might a role for a stats tutorial of some sort (as part of our tskit tutorials site that we'll build "at some point"), where the differences between versions of stats like Fst are explored and explained via the tskit API, but I think we need to keep the core documentation focused on description.

gregorgorjanc commented 3 years ago

@castedo you might be interested in https://www.biorxiv.org/content/10.1101/083915v2 https://www.biorxiv.org/content/10.1101/083923v2 https://www.biorxiv.org/content/10.1101/653279v1

richard-Hobart commented 3 months ago

I have read all of the above, and have read the earlier papers by Nei and more recent ones (e.g. Bhatia et al.) and remain unsure of the version of Fst that is produced in tskit.

I can certainly follow the logic in Nei (1986): Fst = Dst/Ht and then redefined Fst' = Dst'/Ht = 1 - Hs/Ht'

Nei and Li (1979) is good for understanding Pi_x, Pi_y and Pi_xy (which tskit will call d(X), d(Y) and d(X,Y)

I would like to know how you then arrive at what is written in the tskit manual

Fst = 1 - 2 (d(X) + d(Y)) / (d(X) + 2 d(X, Y) + d(Y))

Is there a paper that explains the above formula?

petrelharp commented 3 months ago

The formula appears in Slatkin 1991: Screenshot from 2024-07-28 21-46-49 Now, f0 is "the probability of identity of two genes sampled from the same subpopulation". Let's interpret that to be the chance that if you pick one of the two pops at random and then pick two copies of the gene in that pop, that the two copies are the same. Similarly, let's say that f-bar is "pick a copy of the gene by picking a pop at random and then picking a gene copy from that pop; do that again; then f-bar is the probability the copies are the same". By these definitions, since for instance d(X) is the chance two copies of the gene from X differ,

1 - f0 = (d(X) + d(Y))/2

and (edited)

1 - f-bar = (d(X) + 2 * d(X,Y) + d(Y)) / 4

This gives the equation listed.

Note there is ambiguity in what "sampled at random" means. We've chosen a definition so that it doesn't depend on the sample sizes in the two populations, a desireable property.

sunyatin commented 2 months ago

Thank you all for this thread which was very useful to understand tskit implementation! Peter, just a confirmation: when you say

f-bar = (d(X) + 2 * d(X,Y) + d(Y)) / 4

shouldn't it be?

1 - f_bar = (d(X) + 2 * d(X,Y) + d(Y)) / 4

petrelharp commented 2 months ago

Hah, yes - thanks; I'll edit the above to correct that.

hyanwong commented 1 month ago

Just to note that https://www.biorxiv.org/content/10.1101/2024.09.24.614506v1 looks like a potentially useful new Fst reference.

hyanwong commented 4 days ago

Also https://www.biorxiv.org/content/10.1101/2024.10.28.620737v1: