jgx65 / hierfstat

the hierfstat package
24 stars 14 forks source link

pairwise.fst #6

Closed juance100 closed 7 years ago

juance100 commented 8 years ago
pairwise Fst estimates obtained with pairwise.fst() differ from those got with pp.fst()
> data(nancycats)
# convert nancycats to data.frame to be passed by pp.fst
> dfcats <- genind2df(nancycats)
# convert dfcats to numeric as required by pp.fst
> for(i in 2:10){dfcats[,i]<-as.numeric(as.character(dfcats[,i]))}
> dfcats[,1]<-as.numeric(dfcats[,1])
# estimate pairwise Fst using the functions pairwise.fst and pp.fst
> pairwise <- pairwise.fst(nancycats)
> pp <- pp.fst(dfcats)
# compare the estimates
> as.matrix(pairwise)[1:3,1:3]
           1         2          3
1 0.00000000 0.0801850 0.07140847
2 0.08018500 0.0000000 0.08200880
3 0.07140847 0.0820088 0.00000000
> pp$fst.pp[1:3,1:3]
     [,1]      [,2]       [,3]
[1,]  NaN 0.1307741 0.08459701
[2,]  NaN       NaN 0.13111707
[3,]  NaN       NaN        NaN

The estimates differ
thibautjombart commented 8 years ago

Another person flagged differences too. I checked the function quickly, and I can find no discrepancies between what the doc says and what the function actually does. Doc:

Let A and B be two populations of population sizes n_A and n_B,
     with expected heterozygosity (averaged over loci) Hs(A) and Hs(B),
     respectively. We denote Ht the expected heterozygosity of a
     population pooling A and B. Then, the pairwise Fst between A and B
     is computed as:
     Fst(A,B) = \frac{(Ht - (n_A Hs(A) + n_B Hs(B))/(n_A + n_B) )}{Ht}

function actually computing Fst:

  ## function to compute one Fst ##
    f1 <- function(pop1, pop2){ # pop1 and pop2 are genind obj. with a single pop each
        n1 <- nrow(pop1@tab)
        n2 <- nrow(pop2@tab)
        temp <- repool(pop1,pop2)
        b <- weighted.mean(Hs(temp), c(n1,n2)) # mean Hs is weighted for pop sizes
        pop(temp) <- NULL
        a <- Hs(temp)
        return((a-b)/a)
    }

a is the expected H pooling all individuals (averaged over loci), and b is the averaged expected H in the two 'populations', weighted for their respective sample sizes. Quickly checking, this looks compatible with eq 8 and 9 in http://www.pnas.org/content/70/12/3321.full.pdf

mijter commented 8 years ago

Some issues that I see with this code are: -Nei does not seem to weigh the populations by the sample size, the weighing he uses is simply 1/s where s is the number of populations. -Nei later noted that this way of calculating Fst has an implicit comparison of populations with themselves, giving it a downwards bias. This is especially notable when the number of populations is low, such as with pairwise comparisons. -The way Hs is calculated by adegenet does not use any correction for bias stemming from limited sample sizes.

To solve these issues, it would be best to use the equations from: Nei, M. 1987. Molecular Evolutionary Genetics. Columbia University Press, New York. These are actually also quite clearly explained in the Help-files from Fstat.

thibautjombart commented 8 years ago

Thanks - very useful. Some comments / questions:

@jgx65 do you have another implementation of pairwise Fst in hierfstat?

thierrygosselin commented 8 years ago

Here some useful reading for this... Meirmans, P. G., & Hedrick, P. W. (2011). Assessing population structure: FST and related measures. Molecular Ecology Resources, 11(1), 5–18. http://doi.org/10.1111/j.1755-0998.2010.02927.x

mijter commented 8 years ago

Hierfstat has the "pp.fst" function, that I suppose uses the approach from Yang 1998. For the Nei approach, one could make use of the "basic.stats" function. Given a hierfstat object "dat" one could use something like this:

pairwise.nei <- function(dat){
    fstmat <- array(0, c(20,20))
    for(a in 2:max(dat[,1])){
        for(b in 1:(a-1)){
            subdat = dat[dat[,1] == a | dat[,1] == b,]
            bs = basic.stats(subdat)
            fstmat[a,b] = fstmat[b,a] = bs$overall[8]
        }   
    }
    return(as.dist(fstmat))
}
jgx65 commented 8 years ago

hierfstat::pp.fst estimates paiwise fst using weir and cockerham (1984) estimator (I need to add this to the help page), whereas adegenet::pairwise.fst computes nei's pairwise Fst (from help page). But pairwise.nei suggested by @mijter does not give the same as adegenet::pairwise.fst either. @thibautjombart, do you remember which equation you have been using for pairwise.fst? I see that you are using a weighted mean of Hs, but Nei's explicitely says no weighting (either for Hs or allele frequencies in fact)

jgx65 commented 8 years ago

ps: @thibautjombart , the equations used for hierfstat::basic.stats are given in the help page of the function

juance100 commented 8 years ago

Nei explicitely recommended not weighting to avoid bias associated with unbalanced sampling sizes.

jgx65 commented 8 years ago

Indeed, and this is at the heart of the differences between estimators of FST devised by Nei on the one hand, and Weir and Cockerham on the other

jgx65 commented 8 years ago

After checking, adegenet::pairwise.fst computes pairwise FST from uncorrected HS (i.e., complement to one of the sum of squared allelic frequencies. hierfstat::basic.stats uses Nei's 1987 estimators of Hs, accounting for sample size and observed heterozygosity, in addition to using unweighted Hs. My recommendation would be to use pp.fst if on wants weir and cockerham estimator, and a wrapper like the code suggested by @mijter above if one wants Nei's pairwise.

thibautjombart commented 8 years ago

It may be problematic to have both estimators in the same package.. @jgx65 what do you think, shall we just change pairwise.fst into a wrapper for pp.fst, with a message about the function being deprecated?

juance100 commented 8 years ago

The suggestion by thibautjombart seems to be the simplest. An alternative might be keeping only one function (pp.fst) but giving the chance to choose the method.

thibautjombart commented 8 years ago

I'd be keen to keep pairwise.fst as deprecated for backward compatibility.

zkamvar commented 8 years ago

Since the issue is the calculation of Hs, one potential solution might be to add Hs and Ht functions passed as an argument.

For example:


## weighted mean of Heterozygosity ##
Hsw <- function(temp){
    pop_sizes <- as.vector(table(pop(temp))
    return(weighted.mean(Hs(temp), pop_sizes))
}

## function to compute one Fst ##
f1 <- function(pop1, pop2, HSFUN, HTFUN){ # pop1 and pop2 are genind obj. with a single pop each
    temp <- repool(pop1,pop2)
    b <- HSFUN(temp)
    pop(temp) <- NULL
    a <- HTFUN(temp)
    return((a-b)/a)
}

f1(pop1, pop2, HSFUN = Hsw, HTFUN = Hs) # using weighted mean
f1(pop1, pop2, HSFUN = Hs, HTFUN = Hs)  # using 1 - sum(pi^2)

This would also ensure that any other possible estimator for Hs can be dropped in without rewriting. An example of a use for this beyond Nei vs. Weir would be calculating Hs for clonal organisms using allele frequencies calculated with round-robin clone-censoring: http://onlinelibrary.wiley.com/doi/10.1111/j.1365-294X.2007.03535.x/abstract

thibautjombart commented 8 years ago

I like this option, but there is a bit of a design issue. Functions passed as argument should all return the same type of output. Here, Hs returns a vector of Hs values (one per pop), while Hsw returns a single value.

A tweak would be:

jgx65 commented 8 years ago

Not as straightforward as it looks, the implementation for WC84 and Nei87 are a bit trickier than a simple HS function. But I like the idea of having a single function with several options. we could have a type argument in the call to the function, which can fork to wc84, nei87 or any other implementation. What do you think?

mijter commented 8 years ago

I find this a very elegant solution. It is also possible to implement Nei73, Nei87, and WC84 without writing any new code for the actual estimation as this can all be done using the old code, basic.stats and pp.fst.

thibautjombart commented 8 years ago

I love this idea! Happy to help writing it natively for genind too. Are we aiming for a function doing solely pairwise Fst, or any Fst with an option for pairwise comparisons?

jgx65 commented 8 years ago

There is a function hierfstat::genet.dist already providing 8 different genetic distances. Since WC84 and Nei87 when calculated pairwise are nothing else than another brand of genetic distances, what do you think of adding these 2 distances as options in genet.dist, and leaving the old function hierfstat::pp.fst as another way to access WC84?

jgx65 commented 8 years ago

genet.dist now has 10 distances, the last two being Nei87 and WC84.

thierrygosselin commented 8 years ago

Really nice! Now I need to finish my function to have haplo2hierfstat for GBS data.

On Sep 11, 2015, at 09:07, Jerome Goudet notifications@github.com wrote:

genet.dist now has 10 distances, the last two being Nei87 and WC84.

— Reply to this email directly or view it on GitHub https://github.com/jgx65/hierfstat/issues/6#issuecomment-139540905.

jgx65 commented 8 years ago

Should I close this issue now, or is there still something that you'd like to be done @thibautjombart @juance100 ?

thibautjombart commented 8 years ago

Please leave it open for now - when I get a second I'll either change pairwise.fst into a wrapper for genet.dist, or remove it altogether and amend adegenet and the doc accordingly.

juance100 commented 8 years ago

I have noticed that the diagonal elements of the matrix produced by genet.dist() with the method "Ds" are not 0, and pairwise distances differ from those produced by dist.genpop() for method 1.

jgx65 commented 8 years ago

Thanks for the report, should be fixed. Can you check please? Thanks.

juance100 commented 8 years ago

There are still some issues with the methods "WC84" and "Nei87". In particular, dist.genet() with "Nei87" gives several NaN results and the remaining estimates are different from those got with pairwise.fst. I also observed that neither "WC84" nor "Nei87" works properly with haploid data

jgx65 commented 8 years ago

@juance100 could you give specific examples? As for the differences with pairwise $F_{ST}$, it has been discussed in this thread, there are issues with the way pairwise.fst weight the different samples. Thanks.

juance100 commented 8 years ago

Thanks for your answer. As already discussed in this thread, pairwise.fst(genind) should give the same results as genet.dist(dataframe, method="Nei87"), provided the data sets contain the same information. However see the following script:

data(nancycats) dfcats <- genind2df(nancycats) dfcats <- data.frame(sapply(dfcats, as.numeric)

Here the genetic information is the same, but dfcats is a dataframe prepared for hierfstat and nancycats is a genind object.

pairwise1 <- genet.dist(dfcats, method="Nei87") pairwise2 <- pairwise.fst(nancycats)

pairwise1 and pairwise2 should give the same results, but compare these partial results:

as.matrix(pairwise1)[11:16, 11:17] 11 12 13 14 15 16 17 11 0.0000 0.0803 0.1030 0.0452 0.0592 0.0645 NaN 12 0.0803 0.0000 0.1185 0.0794 0.1029 0.1185 NaN 13 0.1030 0.1185 0.0000 0.1005 0.1042 0.1239 0.1318 14 0.0452 0.0794 0.1005 0.0000 0.0569 0.0849 NaN 15 0.0592 0.1029 0.1042 0.0569 0.0000 0.0972 NaN 16 0.0645 0.1185 0.1239 0.0849 0.0972 0.0000 NaN

as.matrix(pairwise2)[11:16, 11:17] 11 12 13 14 15 16 17 11 0.00000000 0.05157358 0.06587454 0.03764621 0.04403060 0.04604060 0.043564731 12 0.05157358 0.00000000 0.08242185 0.06028268 0.07498297 0.08376789 0.003096283 13 0.06587454 0.08242185 0.00000000 0.06685364 0.07575865 0.08684894 0.061348922 14 0.03764621 0.06028268 0.06685364 0.00000000 0.04474257 0.05973898 0.051303403 15 0.04403060 0.07498297 0.07575865 0.04474257 0.00000000 0.07316692 0.051783883 16 0.04604060 0.08376789 0.08684894 0.05973898 0.07316692 0.00000000 0.073406746

At first I thought that the problem might be caused by missing data in population 17, however differences are also observed in all fst estimates.

jgx65 commented 8 years ago

Issue with NA fixed (basic.stats was not trapping an Inf):

data(nancycats) dfcats <- genind2df(nancycats) dfcats <- data.frame(sapply(dfcats, as.numeric)) dist1<-genet.dist(dfcats,method="Nei87") dist2<-pairwise.fst(nancycats) as.matrix(dist2)[14:16,14:17]

       14         15         16  17

14 0.00000000 0.04474257 0.05973898 NaN 15 0.04474257 0.00000000 0.07316692 NaN 16 0.05973898 0.07316692 0.00000000 NaN

as.matrix(dist1)[14:16,14:17]

   14     15     16     17

14 0.0000 0.0569 0.0849 0.0910 15 0.0569 0.0000 0.0972 0.1001 16 0.0849 0.0972 0.0000 0.1535

(Funnily enough, on my computer, it is pairwise.fstgiving NAs now!). The issue indeed comes from one locus not typed in population 17.

For your second point (fst estimates not equal), this was discuss earlier in this thread. Basically, method "Nei87" corrects for sample size and number of samples, while paiwise.fstdoes not.

With equal sample sizes, the relation is not one to one, but at least it is linear:

dat<-sim.genot(nbloc=10,nbpop=10) d1<-genet.dist(dat,method="Nei87") datgi<-df2genind(dat[,-1],pop=dat[,1],sep="")

d2<-pairwise.fst(datgi) plot(as.vector(d1),as.vector(d2)) abline(c(0,1))

juance100 commented 8 years ago

A funny behavior of genet.dist with methods "Nei87" and "WC84" occurs (at list for haploid data) when the populations in the dataframe are not ordered. Here you have a partial view of a "PA" dataset, where populations are not ordered:

hierchil[1:5,1:5] Reg pob loc1 loc2 loc3 1 3 2 1 1 1 2 3 2 1 1 1 3 3 2 1 1 0 4 3 2 1 1 1 5 3 2 1 1 1

hierchil[135:140,1:5] Reg pob loc1 loc2 loc3 135 1 8 1 1 1 136 1 8 1 1 1 137 1 8 1 1 1 138 1 8 1 1 1 139 1 8 1 1 1 140 1 8 1 1 1

The results obtained with genet.dist where

genet.dist(hierchil[,-1], diploid=F, method="WC84") Error in as.matrix(pp.sigma.loc(unique(Pop)[i], unique(Pop)[j], dat, diploid, : error in evaluating the argument 'x' in selecting a method for function 'as.matrix': Error in rep(1:nl, alploc) : argumento 'times' inválido genet.dist(hierchil[,-1], diploid=F, method="Nei87") 2 7 9 5 3 4 1 6 7 0.0828
9 NA 0.0734
5 NA 0.0882 0.0593
3 NA 0.2255 0.2412 0.2905
4 NA 0.0614 0.0818 0.1209 0.2433
1 NA 0.0798 0.0930 0.1753 0.2495 0.0834
6 NA 0.1408 0.0756 0.0831 0.1792 0.1200 0.1026
8 NA 0.1123 0.1023 0.1205 0.2333 0.0429 0.1020 0.1160

However, if the dataset is ordered the problem is solved:

genet.dist(hierchil[order(hierchil[,2]),-1], diploid=F, method="Nei87") 1 2 3 4 5 6 7 8 2 0.0828
3 0.1311 0.0734
4 0.1530 0.0882 0.0593
5 0.2017 0.2255 0.2412 0.2905
6 0.0893 0.0614 0.0818 0.1209 0.2433
7 0.1007 0.0798 0.0930 0.1753 0.2495 0.0834
8 0.1120 0.1408 0.0756 0.0831 0.1792 0.1200 0.1026
9 0.0683 0.1123 0.1023 0.1205 0.2333 0.0429 0.1020 0.1160 genet.dist(hierchil[order(hierchil[,2]),-1], diploid=F, method="WC84") 1 2 3 4 5 6 7 8 2 0.08379581
3 0.13066940 0.07399586
4 0.14935276 0.08789807 0.05844187
5 0.18338896 0.21793496 0.22910750 0.28572331
6 0.08927498 0.06332823 0.08213510 0.11904226 0.22555663
7 0.10088256 0.08233133 0.09346461 0.17284896 0.22989331 0.08339520
8 0.11223669 0.14411704 0.07603829 0.08138786 0.16235048 0.12001909 0.10263345
9 0.06773678 0.11349548 0.10229576 0.11964241 0.22182280 0.04286540 0.10197628 0.11597602

The behavior of pp.fst is the same as genet.dist with "WC84)"

jgx65 commented 8 years ago

Thanks for this. Quick fix is genet.dist(dat[order(dat[,1]),],diploid=F,method="WC84"). On Github

juance100 commented 8 years ago

Fine. The same fix applies to pp.fst(dat[order(dat[,1]),] diploid=F). Regards