jgx65 / hierfstat

the hierfstat package
24 stars 14 forks source link

tests lead to g.star=0 on all permutations #28

Closed nehasavant closed 5 years ago

nehasavant commented 5 years ago

Hello -

I've run varcomp.glob() successfully on my SNP data with three levels (prov, pop, site), but I'm having trouble testing the different levels. I've tried all tests at nperm=100, but each g.star value is 0 for all permutations and the p-value = 1. I'm not sure what is going wrong.

varcomp.glob(levels=levels1, loci=loci, diploid = TRUE) test.within(loci, test=site, within = pop, nperm=100) test.between.within(data.frame(loci), within = prov, rand.unit = site, test.lev = pop, nperm=100) test.between(loci, rand.unit=pop, test=prov, nperm=100)

Any advice?

jgx65 commented 5 years ago

The syntax looks correct. Difficult to answer without a toy example, could you post an example? Do you get meaningful results with a subset of loci?

nehasavant commented 5 years ago

I ran the tests with a subset of 100 SNPs and it worked, I got plausible g-star values and a p-value. Thanks for the suggestion! Seems like my computer couldn't handle testing ~7k SNPs and 221 individuals.

Unfortunately, it seems that my processor/RAM combo can only handle testing 100 SNPs (anything higher and I get meaningless g-star/p-values), but the varcomp.glob f-stat coefficients at 100 SNPs are 0.02 to 0.05 off from the full dataset coefficients. Do you have suggestions for a minimum number of SNPs for accurate subsetting?

jgx65 commented 5 years ago

I cannot reproduce the behavior you described with simulated data. For instance with 1000 SNPs:

library(hierfstat)
dat<-sim.genot(size=100,nbloc=1000,nbal=2,nbpop=5,N=10000,mig=0.001,mut=1e-06) 

wc(dat) 
$`FST` [1] 0.02529191  
$FIS [1] -0.001185896  

t.b<-test.between(dat[,-1],test.lev=dat[,1],rand.unit=rep(1:50,each=10)) 
t.w<-test.within(dat[,-1],within=dat[,1],test.lev=rep(1:50,each=10)) 

str(t.w) 
List of 2  $ g.star: num [1:100] 69845 69784 69028 69895 70473 ...  $ p.val : num 0.53 

str(t.b) 
List of 2  $ g.star: num [1:100] 4939 5468 5380 5206 6102 ...  $ p.val : num 0.01

I need a reproducible example to investigate further. Could you send one?

jgx65 commented 5 years ago

@nehasavant any heads up / progress? Can I close this issue? Cheers

nehasavant commented 5 years ago

Hi - Sorry for the delay. I'm not sure how to send a reproducible example without sending you my actual files. One difference from your example dataset is that I have some missing data (~3.4%).

jgx65 commented 5 years ago

Indeed, with 7k SNPs and 3-4% missing data, it is very likely that all individuals have at least one missing genotype, and the core of the testing is through g.stats.glob, which filters out any individual with a missing genotype. As it is difficult /unclear how to impute properly missing genotypes, I recommend in this situation to use bootstrapping over loci (e.g. boot.vc). If the CI does not overlap with 0, you can consider the statistic to differ significantly from 0.

nehasavant commented 5 years ago

Thank you for this recommendation. The boot.vc worked well, even when using all ~7k SNPS:

boot_all_1000 <- boot.vc(levels=levels[, -1],loci=data2[, -1], diploid=TRUE, nboot=1000, quant=c(0.025,0.5,0.975))

Within boot_all_1000$ci, I found that for all pairings the CI does not overlap with 0. To confirm, this means that all statistics at all levels are significantly different from zero and therefore levels 1, 2 and 3 effect on genetic structure are all significant?

This is interesting since with the subset of 100 loci, I found that only level 3 (which was tested with test.within()) had a significant effect on genetic structure. I'm assuming more loci and permutations were needed.

jgx65 commented 5 years ago

Sounds correct. Difficult to say more without seeing the data. Best wishes