jgx65 / hierfstat

the hierfstat package
24 stars 14 forks source link

How to Interpret Negative Fsts #64

Closed alexkrohn closed 1 year ago

alexkrohn commented 1 year ago

I am attempting to determine whether the pairwise FST observed between two populations is greater than one would expect by chance. One way I thought to do this would be to generate a "null distribution" of FST values, then compare the actual FST value to the null distribution. If the real value is higher than the 95% confidence interval of the null distribution, I could conclude the real value is larger than one would expect by chance. (If there is a simpler way to test this, I'm open to suggestions!)

For example, in myvcf I have 10 individuals, 5 from one population and 5 from another, in that order. To generate the null distribution and confidence interval, I do:

pop.vector <- rep(c("pop1",5), c("pop2",5))

list.of.population.vectors <- list()

# Create 100 random combinations of 5 individuals in pop1 and 5 individuals in pop2
for(i in 1:100){
list.of.population.vectors[[i]] <- sample(pop.vector)
}

# Calculate fst 100 times
fsts <- lapply(list.of.population.vectors, function(x) fst.dosage(myvcf, x)[3]) %>% unlist()

# Calculate the 95% confidence interval
t.test(fsts)

# Calculate the real FST
real.fst <- fst.dosage(myvcf, pop.vector)[3]

In my actual dataset, the real FST value is 0.002. However, the 95% confidence interval goes from -0.005 to -0.0007.

Biologically, I know that even in the real populations, there is very very little differentiation. However, I'm still curious how to interpret those negative FST values. Is it just sampling error with small population sizes and many alleles? (The dataset is ~30k SNPs from RADseq.) Should a negative FST value be interpreted as 0, or better as NA?

Thanks for your help!

jgx65 commented 1 year ago

Hi @alexkrohn, thanks for your question. If the parameter Fst is truly 0, then its estimates might be slightly negative. Your permutation test is sound, I would simply count the number of randomized Fst larger or equal to the observed to obtain the p -value, and/or calculate the 95% empirical CI (quantile(nfst,c(0.025,0.975)) rather than the parametric one :

#simulate 2 pops with Nm=100, 5 inds in each pop 30k SNPs
dat<-sim.genot(size=5,nbal=2,nbpop=2,nbloc=30000,N=10000,mig=0.01)
#transform data to dosage
dos<-biall2dos(dat[,-1])
#estimate true fst
(tfst<-fst.dosage(dos,rep(1:2,each=5))[3])
#empirical null fst distribution
nfst<-replicate(999,fst.dosage(dos,sample(rep(1:2,each=5)))[3])
#associated p-value
mean(c(nfst,tfst)>=tfst)
#empirical 95% ci of null dist for Fst
quantile(nfst,c(0.025,0.975))

#redo with Nm=1000
dat<-sim.genot(size=5,nbal=2,nbpop=2,nbloc=30000,N=10000,mig=0.1)
dos<-biall2dos(dat[,-1])
#estimate true fst
(tfst<-fst.dosage(dos,rep(1:2,each=5))[3])
#empirical null fst distribution
nfst<-replicate(999,fst.dosage(dos,sample(rep(1:2,each=5)))[3])
#associated p-value
mean(c(nfst,tfst)>=tfst)
#empirical 95% ci of null dist for Fst
quantile(nfst,c(0.025,0.975))
alexkrohn commented 1 year ago

Thanks for your help with this test. I appreciate the understanding that negatives should be treated as 0.