uqrmaie1 / admixtools

https://uqrmaie1.github.io/admixtools
71 stars 14 forks source link

Do qpWave, qpAdm, and f5 have requirements on the minimum number of individuals for each population? #58

Open maruiqi0710 opened 7 months ago

maruiqi0710 commented 7 months ago

In James B's article (https://doi.org/10.1093/sysbio/syv023), it is mentioned that:

The f-statistics (f4 for a four-population clade) are analogous to the D-statistic in form, but use population allele frequencies to estimate the proportion of admixture/introgression (Reich et al. 2009, 2011; Patterson et al. 2012). D-statistics only require one sequence per taxon and are thus suitable for phylogenetic sampling, whereas f -statistics can only be used when robust population allele sampling data are available.

The calculations for qpWave, qpAdm, and f5 are based on f-statistics. Do these methods have a minimum requirement for the number of individuals in each population? Is it acceptable if some populations have only one sample or several samples?

uqrmaie1 commented 7 months ago

Good question.

Generally, more samples make allele frequency estimates more accurate, which will lead to more accurate results (more power, lower p-values). For f4, and for any tools based on f4-statistics (qpWave, qpAdm), there is no minimum sample requirement, and even single pseudohaploid/pseudodiploid samples (with no heterozygous genotypes) can be used to represent a population.

The D-statistic is the same as f4 divided by a factor to keep it between 0 and 1 (see here), and I don't think that this difference makes the D-statistic preferable to f4 when fewer samples are available. But I'd be happy to be corrected!

If I had to guess what prompted the statement "f-statistics can only be used when robust population allele sampling data are available": For the f-statistics FST, f2, and f3, low sample size can not only lead to higher variance in the estimates, but also to bias. Most estimators of FST, f2, and f3 correct for this bias, so the estimates are unbiased even at low sample size. The bias correction requires at least two haploid chromosomes per population (or one diploid sample), and it can be off for inbred samples. So for some estimators of some f-statistics, it is correct to say that they can only be used when robust population allele sampling data are available.

However, f4 is unbiased out of the box, without any bias correction, and regardless of sample size. f4 can be biased for other reasons (non-random SNP ascertainment or missing data), but the same is true for the D-statistic.