jgx65 / hierfstat

the hierfstat package
24 stars 14 forks source link

estimated Fis value outside of bootstrapped CIs #49

Closed casparbein closed 3 years ago

casparbein commented 3 years ago

Hello,

I have been using hierfstat to calculate Fis for several populations. But for many populations, the values fis.dosage() gives me lie outside the confidence intervals I get from boot.ppfis().

My code looks like this: test <- read.vcfR("~/40miss_all_filtOneSNP.vcf", verbose = TRUE) test <- vcfR2genind(test) pops <- c(rep(1, 4), rep(2, 10), rep(3, 2), rep(4,9), rep(5, 10), rep(6, 6), rep(7,4), rep(8,9), rep(9,3), rep(10,1), rep(11, 12)) test <- genind2hierfstat(test, pop = pops) boot_fis <- boot.ppfis(test, nboot=10000) dos <- fstat2dos(test, diploid = T) fis <- fis.dosage(dos, pop = pops)

The output in turn looks like this: Confidence intervals: $fis.ci ll hl 1 0.0713 0.0928 2 0.1122 0.1268 3 -0.5400 -0.4413 4 0.1215 0.1379 5 0.1059 0.1194 6 0.1076 0.1271 7 0.1167 0.1432 8 0.1298 0.1465 9 -0.2841 -0.1763 10 -Inf -Inf ## since there is only one ind in this population, this makes sense 11 0.0825 0.0955

values computed by fis.dosage: 1 0.07552508
2 0.10268142
3 0.06572209
4 0.10293569
5 0.09418226
6 0.10427239
7 0.16800170 8 0.11198415
9 -0.18332626
10 NaN
11 0.07371901
All 0.07156975

My dataset consists of around 23000 SNPs with quite a bit of missing data. Even if I calculate Fis for individual populations using for example basic.stats(test[c(15:17),]), the thus retained Fis value is still not within the CIs. Any idea what I can change to get meaningful CIs?

Your help is much appreciated, Cheers, Bernhard

jgx65 commented 3 years ago

Hi Bernhard,

It is difficult to answer your questions without seeing your data. As a first insight, you might check if fis.dosage is giving you the "same" as basic.stats .

bs<-basic.stats(test)
bs.fis<-1-colMeans(bs$Ho,na.rm=TRUE)/colMeans(bs$Hs,na.rm=TRUE)

range(bs.fis-fis[-12],na.rm=TRUE)

If it does, then the issue is with bootstrapping SNPs with many missing values. With 23k loci, i would suggest building a Welsh confidence interval (i.e, mean(Fis) +/- 2SE, where SE is standard error estimated over loci) which should be quite robust (although it might not have proper coverage if your loci are linked, but this is also a problem with bootstrapping loci).

Let me know if this works.

Cheers

casparbein commented 3 years ago

Hi Jerome,

thank you very much for your reply. I used the code you posted and unfortunately, the results from basic.stats and fis.dosage are slightly different. They are in the same range, might the difference be due to the way missing data is handled?

My code looks like this:

basic <- basic.stats(test1) basic.fis <- 1-colMeans(basic$Ho,na.rm=TRUE)/colMeans(basic$Hs,na.rm=TRUE) dos <- fstat2dos(test1, diploid = T) fis.dosage(dos, pop = test1$pop, matching = FALSE)

And these are the results:

basic.fis 1 0.08545589
2 0.11897349
3 0.04078267
4 0.13296026
5 0.11256335
6 0.12549499
7 0.17614069 8 0.13882509
9 -0.19559441
10 NaN
11 0.08918158

fis.dosage(dos, pop = test1$pop, matching = FALSE) 1 0.07552508
2 0.10268142
3 0.06572209
4 0.10293569
5 0.09418226
6 0.10427239
7 0.16800170
8 0.11198415
9 -0.18332626
10 NaN
11 0.07371901
All 0.07156975

Which of the calculations would you trust more? And regarding the SE, would this be the correct way to calculate it?

apply(basic$Fis, MARGIN = 2, FUN = sd, na.rm = TRUE)/sqrt(23509) ## 23509 is the number of loci

As a background, my data comes from ddRADseq, which is notorious for different amounts of missing data between individuals. I have a dataset with no missing data, which consists of about 2300 SNPs, but so far we wanted to try to include as much data as possible in calculating Fst and Fis. I will run the Fis-analysis with the 0%-missing dataset and let you know if the differences between fis.dosage and basic.stats persist.

Thank you again for your time and effort,

Cheers,

Bernhard

jgx65 commented 3 years ago

They will differ slightly (they are different estimators of Fis), the more so for small samples and larger amount of missing data. But if you plot one against the other, they are not far from the y=x line, with some noise, larger for smaller samples.

For SE, you should divide by the square root of the number of typed loci (apply(basic$Fis,2,function(x) sum(!is.na(x))))

casparbein commented 3 years ago

Update concerning 0% missing data: fis.dosage() and basic.stats() still lead to different results, but as you explained, they are close to the y=x line in a plot. Interestingly, bootstrapping works now, with intervals overlapping with values calculated by basic.stats(). These intervals overlap with 0 for a bunch of populations, a phenomenon not seen in the Fis values (+- 2E) calculated with 40 % missing data/23000 loci. We suspect that more data will lower the standard error and therefore, many intervals will not overlap with 0, but we are not sure if that is biologically reasonable. The null expectation is that populations are neither in- nor outbreeding and Fis is 0. In our case, the Fis values of 9 out of 10 populations are 2SEs away from 0, consequently significantly different from 0 when using 23000 loci.

My code to calculate the SE now looks like this: se <- apply(basic$Fis, MARGIN = 2, FUN = sd, na.rm = TRUE)/sqrt(apply(basic$Fis,2,function(x) sum(!is.na(x)))) (1-colMeans(basic$Ho,na.rm=TRUE)/colMeans(basic$Hs,na.rm=TRUE) ) + c(1,-1)*2*se

Would you rather suggest to use less loci and less missing data or do you think that the positive Fis values can be used in a possible publication? And do you think we can report confidence intervals calculated as stated above in a possible publication?

Thank you again very much for your insights, Cheers, Bernhard

jgx65 commented 3 years ago

The issue with building CIs is one of independence. I suspect some of your loci are in LD, and therefore cannot be considered as independent pieces of information. If they are mapped, and/or you can estimate LD, you may be able to do block bootstrap, ie, define a block size for which you estimate Fis and bootstrap these blocks. But this will have to be done manually, no code yet for this. Or you might "guestimate" the number of independent blocks, and divide SD by the square root of this guestimate

Positive values of Fis might arise because your samples do not belong to the same reproductive group (Walhund effect).

Hope this helps

casparbein commented 3 years ago

Thank you for your comments. As we did ddRADseq and assembly without a reference, we have no way of knowing the linkage of our RAD loci. Malinsky et al., 2018 for instance consider RAD loci to be unlinked in their software package fineRADstructure, although they urge for caution with underestimating linkage. For our project, we now plan to employ the CIs you suggested and will mention possible downsides along the way. Regarding posivite FIs, we are suspecting considerable gene flow between populations (high within group heterozygosity), so it is not totally unexpected to see positive values for FI, but we were still surprised that all populations showed values +-2SEs away from 0.

If you have any final suggestions, I'd appreciate them!

All the best,

Bernhard

jgx65 commented 3 years ago

perhaps one thing to do is to remove loci with more than 10% missing individuals, and individuals with more than 20% missing loci (the 10% and 20% are from personal experience, but you can tune this and see how Fis and CI change with these parameters. This is what I would recommend as a reviewer)

Best of luck for the paper, and, if you feel your questions have been answered, you can close this issue

Cheers