Open maruiqi0710 opened 7 months ago
Good question.
Generally, more samples make allele frequency estimates more accurate, which will lead to more accurate results (more power, lower p-values). For f4, and for any tools based on f4-statistics (qpWave, qpAdm), there is no minimum sample requirement, and even single pseudohaploid/pseudodiploid samples (with no heterozygous genotypes) can be used to represent a population.
The D-statistic is the same as f4 divided by a factor to keep it between 0 and 1 (see here), and I don't think that this difference makes the D-statistic preferable to f4 when fewer samples are available. But I'd be happy to be corrected!
If I had to guess what prompted the statement "f-statistics can only be used when robust population allele sampling data are available": For the f-statistics FST, f2, and f3, low sample size can not only lead to higher variance in the estimates, but also to bias. Most estimators of FST, f2, and f3 correct for this bias, so the estimates are unbiased even at low sample size. The bias correction requires at least two haploid chromosomes per population (or one diploid sample), and it can be off for inbred samples. So for some estimators of some f-statistics, it is correct to say that they can only be used when robust population allele sampling data are available.
However, f4 is unbiased out of the box, without any bias correction, and regardless of sample size. f4 can be biased for other reasons (non-random SNP ascertainment or missing data), but the same is true for the D-statistic.
In James B's article (https://doi.org/10.1093/sysbio/syv023), it is mentioned that:
The calculations for qpWave, qpAdm, and f5 are based on f-statistics. Do these methods have a minimum requirement for the number of individuals in each population? Is it acceptable if some populations have only one sample or several samples?