Closed juance100 closed 8 years ago
Another person flagged differences too. I checked the function quickly, and I can find no discrepancies between what the doc says and what the function actually does. Doc:
Let A and B be two populations of population sizes n_A and n_B,
with expected heterozygosity (averaged over loci) Hs(A) and Hs(B),
respectively. We denote Ht the expected heterozygosity of a
population pooling A and B. Then, the pairwise Fst between A and B
is computed as:
Fst(A,B) = \frac{(Ht - (n_A Hs(A) + n_B Hs(B))/(n_A + n_B) )}{Ht}
function actually computing Fst:
## function to compute one Fst ##
f1 <- function(pop1, pop2){ # pop1 and pop2 are genind obj. with a single pop each
n1 <- nrow(pop1@tab)
n2 <- nrow(pop2@tab)
temp <- repool(pop1,pop2)
b <- weighted.mean(Hs(temp), c(n1,n2)) # mean Hs is weighted for pop sizes
pop(temp) <- NULL
a <- Hs(temp)
return((a-b)/a)
}
a
is the expected H pooling all individuals (averaged over loci), and b
is the averaged expected H in the two 'populations', weighted for their respective sample sizes. Quickly checking, this looks compatible with eq 8 and 9 in http://www.pnas.org/content/70/12/3321.full.pdf
Some issues that I see with this code are: -Nei does not seem to weigh the populations by the sample size, the weighing he uses is simply 1/s where s is the number of populations. -Nei later noted that this way of calculating Fst has an implicit comparison of populations with themselves, giving it a downwards bias. This is especially notable when the number of populations is low, such as with pairwise comparisons. -The way Hs is calculated by adegenet does not use any correction for bias stemming from limited sample sizes.
To solve these issues, it would be best to use the equations from: Nei, M. 1987. Molecular Evolutionary Genetics. Columbia University Press, New York. These are actually also quite clearly explained in the Help-files from Fstat.
Thanks - very useful. Some comments / questions:
@jgx65 do you have another implementation of pairwise Fst in hierfstat?
Here some useful reading for this... Meirmans, P. G., & Hedrick, P. W. (2011). Assessing population structure: FST and related measures. Molecular Ecology Resources, 11(1), 5–18. http://doi.org/10.1111/j.1755-0998.2010.02927.x
Hierfstat has the "pp.fst" function, that I suppose uses the approach from Yang 1998. For the Nei approach, one could make use of the "basic.stats" function. Given a hierfstat object "dat" one could use something like this:
pairwise.nei <- function(dat){
fstmat <- array(0, c(20,20))
for(a in 2:max(dat[,1])){
for(b in 1:(a-1)){
subdat = dat[dat[,1] == a | dat[,1] == b,]
bs = basic.stats(subdat)
fstmat[a,b] = fstmat[b,a] = bs$overall[8]
}
}
return(as.dist(fstmat))
}
hierfstat::pp.fst estimates paiwise fst using weir and cockerham (1984) estimator (I need to add this to the help page), whereas adegenet::pairwise.fst computes nei's pairwise Fst (from help page). But pairwise.nei suggested by @mijter does not give the same as adegenet::pairwise.fst either. @thibautjombart, do you remember which equation you have been using for pairwise.fst? I see that you are using a weighted mean of Hs, but Nei's explicitely says no weighting (either for Hs or allele frequencies in fact)
ps: @thibautjombart , the equations used for hierfstat::basic.stats are given in the help page of the function
Nei explicitely recommended not weighting to avoid bias associated with unbalanced sampling sizes.
Indeed, and this is at the heart of the differences between estimators of FST devised by Nei on the one hand, and Weir and Cockerham on the other
After checking, adegenet::pairwise.fst computes pairwise FST from uncorrected HS (i.e., complement to one of the sum of squared allelic frequencies. hierfstat::basic.stats uses Nei's 1987 estimators of Hs, accounting for sample size and observed heterozygosity, in addition to using unweighted Hs. My recommendation would be to use pp.fst if on wants weir and cockerham estimator, and a wrapper like the code suggested by @mijter above if one wants Nei's pairwise.
It may be problematic to have both estimators in the same package.. @jgx65 what do you think, shall we just change pairwise.fst
into a wrapper for pp.fst
, with a message about the function being deprecated?
The suggestion by thibautjombart seems to be the simplest. An alternative might be keeping only one function (pp.fst) but giving the chance to choose the method.
I'd be keen to keep pairwise.fst
as deprecated for backward compatibility.
Since the issue is the calculation of Hs, one potential solution might be to add Hs and Ht functions passed as an argument.
For example:
## weighted mean of Heterozygosity ##
Hsw <- function(temp){
pop_sizes <- as.vector(table(pop(temp))
return(weighted.mean(Hs(temp), pop_sizes))
}
## function to compute one Fst ##
f1 <- function(pop1, pop2, HSFUN, HTFUN){ # pop1 and pop2 are genind obj. with a single pop each
temp <- repool(pop1,pop2)
b <- HSFUN(temp)
pop(temp) <- NULL
a <- HTFUN(temp)
return((a-b)/a)
}
f1(pop1, pop2, HSFUN = Hsw, HTFUN = Hs) # using weighted mean
f1(pop1, pop2, HSFUN = Hs, HTFUN = Hs) # using 1 - sum(pi^2)
This would also ensure that any other possible estimator for Hs can be dropped in without rewriting. An example of a use for this beyond Nei vs. Weir would be calculating Hs for clonal organisms using allele frequencies calculated with round-robin clone-censoring: http://onlinelibrary.wiley.com/doi/10.1111/j.1365-294X.2007.03535.x/abstract
I like this option, but there is a bit of a design issue. Functions passed as argument should all return the same type of output. Here, Hs
returns a vector of Hs values (one per pop), while Hsw
returns a single value.
A tweak would be:
pairwise.fst
Hs
, triggered by an argument correct=TRUE
by defaultpairwise.fst
Not as straightforward as it looks, the implementation for WC84 and Nei87 are a bit trickier than a simple HS function. But I like the idea of having a single function with several options. we could have a type argument in the call to the function, which can fork to wc84, nei87 or any other implementation. What do you think?
I find this a very elegant solution. It is also possible to implement Nei73, Nei87, and WC84 without writing any new code for the actual estimation as this can all be done using the old code, basic.stats and pp.fst.
I love this idea! Happy to help writing it natively for genind
too.
Are we aiming for a function doing solely pairwise Fst, or any Fst with an option for pairwise comparisons?
There is a function hierfstat::genet.dist
already providing 8 different genetic distances. Since WC84 and Nei87 when calculated pairwise are nothing else than another brand of genetic distances, what do you think of adding these 2 distances as options in genet.dist
, and leaving the old function hierfstat::pp.fst
as another way to access WC84?
genet.dist
now has 10 distances, the last two being Nei87 and WC84.
Really nice! Now I need to finish my function to have haplo2hierfstat for GBS data.
On Sep 11, 2015, at 09:07, Jerome Goudet notifications@github.com wrote:
genet.dist now has 10 distances, the last two being Nei87 and WC84.
— Reply to this email directly or view it on GitHub https://github.com/jgx65/hierfstat/issues/6#issuecomment-139540905.
Should I close this issue now, or is there still something that you'd like to be done @thibautjombart @juance100 ?
Please leave it open for now - when I get a second I'll either change pairwise.fst
into a wrapper for genet.dist
, or remove it altogether and amend adegenet and the doc accordingly.
I have noticed that the diagonal elements of the matrix produced by genet.dist() with the method "Ds" are not 0, and pairwise distances differ from those produced by dist.genpop() for method 1.
Thanks for the report, should be fixed. Can you check please? Thanks.
There are still some issues with the methods "WC84" and "Nei87". In particular, dist.genet() with "Nei87" gives several NaN results and the remaining estimates are different from those got with pairwise.fst. I also observed that neither "WC84" nor "Nei87" works properly with haploid data
@juance100 could you give specific examples? As for the differences with pairwise $F_{ST}$, it has been discussed in this thread, there are issues with the way pairwise.fst weight the different samples. Thanks.
Thanks for your answer. As already discussed in this thread, pairwise.fst(genind) should give the same results as genet.dist(dataframe, method="Nei87"), provided the data sets contain the same information. However see the following script:
data(nancycats) dfcats <- genind2df(nancycats) dfcats <- data.frame(sapply(dfcats, as.numeric)
Here the genetic information is the same, but dfcats is a dataframe prepared for hierfstat and nancycats is a genind object.
pairwise1 <- genet.dist(dfcats, method="Nei87") pairwise2 <- pairwise.fst(nancycats)
pairwise1 and pairwise2 should give the same results, but compare these partial results:
as.matrix(pairwise1)[11:16, 11:17] 11 12 13 14 15 16 17 11 0.0000 0.0803 0.1030 0.0452 0.0592 0.0645 NaN 12 0.0803 0.0000 0.1185 0.0794 0.1029 0.1185 NaN 13 0.1030 0.1185 0.0000 0.1005 0.1042 0.1239 0.1318 14 0.0452 0.0794 0.1005 0.0000 0.0569 0.0849 NaN 15 0.0592 0.1029 0.1042 0.0569 0.0000 0.0972 NaN 16 0.0645 0.1185 0.1239 0.0849 0.0972 0.0000 NaN
as.matrix(pairwise2)[11:16, 11:17] 11 12 13 14 15 16 17 11 0.00000000 0.05157358 0.06587454 0.03764621 0.04403060 0.04604060 0.043564731 12 0.05157358 0.00000000 0.08242185 0.06028268 0.07498297 0.08376789 0.003096283 13 0.06587454 0.08242185 0.00000000 0.06685364 0.07575865 0.08684894 0.061348922 14 0.03764621 0.06028268 0.06685364 0.00000000 0.04474257 0.05973898 0.051303403 15 0.04403060 0.07498297 0.07575865 0.04474257 0.00000000 0.07316692 0.051783883 16 0.04604060 0.08376789 0.08684894 0.05973898 0.07316692 0.00000000 0.073406746
At first I thought that the problem might be caused by missing data in population 17, however differences are also observed in all fst estimates.
Issue with NA fixed (basic.stats
was not trapping an Inf):
data(nancycats) dfcats <- genind2df(nancycats) dfcats <- data.frame(sapply(dfcats, as.numeric)) dist1<-genet.dist(dfcats,method="Nei87") dist2<-pairwise.fst(nancycats) as.matrix(dist2)[14:16,14:17]
14 15 16 17
14 0.00000000 0.04474257 0.05973898 NaN 15 0.04474257 0.00000000 0.07316692 NaN 16 0.05973898 0.07316692 0.00000000 NaN
as.matrix(dist1)[14:16,14:17]
14 15 16 17
14 0.0000 0.0569 0.0849 0.0910 15 0.0569 0.0000 0.0972 0.1001 16 0.0849 0.0972 0.0000 0.1535
(Funnily enough, on my computer, it is pairwise.fst
giving NAs now!). The issue indeed comes from one locus not typed in population 17.
For your second point (fst estimates not equal), this was discuss earlier in this thread. Basically, method "Nei87" corrects for sample size and number of samples, while paiwise.fst
does not.
With equal sample sizes, the relation is not one to one, but at least it is linear:
dat<-sim.genot(nbloc=10,nbpop=10) d1<-genet.dist(dat,method="Nei87") datgi<-df2genind(dat[,-1],pop=dat[,1],sep="")
d2<-pairwise.fst(datgi) plot(as.vector(d1),as.vector(d2)) abline(c(0,1))
A funny behavior of genet.dist with methods "Nei87" and "WC84" occurs (at list for haploid data) when the populations in the dataframe are not ordered. Here you have a partial view of a "PA" dataset, where populations are not ordered:
hierchil[1:5,1:5] Reg pob loc1 loc2 loc3 1 3 2 1 1 1 2 3 2 1 1 1 3 3 2 1 1 0 4 3 2 1 1 1 5 3 2 1 1 1
hierchil[135:140,1:5] Reg pob loc1 loc2 loc3 135 1 8 1 1 1 136 1 8 1 1 1 137 1 8 1 1 1 138 1 8 1 1 1 139 1 8 1 1 1 140 1 8 1 1 1
The results obtained with genet.dist where
genet.dist(hierchil[,-1], diploid=F, method="WC84") Error in as.matrix(pp.sigma.loc(unique(Pop)[i], unique(Pop)[j], dat, diploid, : error in evaluating the argument 'x' in selecting a method for function 'as.matrix': Error in rep(1:nl, alploc) : argumento 'times' inválido genet.dist(hierchil[,-1], diploid=F, method="Nei87") 2 7 9 5 3 4 1 6 7 0.0828
9 NA 0.0734
5 NA 0.0882 0.0593
3 NA 0.2255 0.2412 0.2905
4 NA 0.0614 0.0818 0.1209 0.2433
1 NA 0.0798 0.0930 0.1753 0.2495 0.0834
6 NA 0.1408 0.0756 0.0831 0.1792 0.1200 0.1026
8 NA 0.1123 0.1023 0.1205 0.2333 0.0429 0.1020 0.1160
However, if the dataset is ordered the problem is solved:
genet.dist(hierchil[order(hierchil[,2]),-1], diploid=F, method="Nei87") 1 2 3 4 5 6 7 8 2 0.0828
3 0.1311 0.0734
4 0.1530 0.0882 0.0593
5 0.2017 0.2255 0.2412 0.2905
6 0.0893 0.0614 0.0818 0.1209 0.2433
7 0.1007 0.0798 0.0930 0.1753 0.2495 0.0834
8 0.1120 0.1408 0.0756 0.0831 0.1792 0.1200 0.1026
9 0.0683 0.1123 0.1023 0.1205 0.2333 0.0429 0.1020 0.1160 genet.dist(hierchil[order(hierchil[,2]),-1], diploid=F, method="WC84") 1 2 3 4 5 6 7 8 2 0.08379581
3 0.13066940 0.07399586
4 0.14935276 0.08789807 0.05844187
5 0.18338896 0.21793496 0.22910750 0.28572331
6 0.08927498 0.06332823 0.08213510 0.11904226 0.22555663
7 0.10088256 0.08233133 0.09346461 0.17284896 0.22989331 0.08339520
8 0.11223669 0.14411704 0.07603829 0.08138786 0.16235048 0.12001909 0.10263345
9 0.06773678 0.11349548 0.10229576 0.11964241 0.22182280 0.04286540 0.10197628 0.11597602
The behavior of pp.fst is the same as genet.dist with "WC84)"
Thanks for this. Quick fix is genet.dist(dat[order(dat[,1]),],diploid=F,method="WC84")
. On Github
Fine. The same fix applies to pp.fst(dat[order(dat[,1]),] diploid=F). Regards