uqrmaie1 / admixtools

https://uqrmaie1.github.io/admixtools
71 stars 14 forks source link

Outgroup f3 statistics not working with pseudohaploidized data. #44

Open Hjorvik opened 1 year ago

Hjorvik commented 1 year ago

Hi!

I'm running some Ougroup f3 comparisons on a dataset of modern and ancient data. Ancient data, being of varying quality and coverage, has been pseudohaploidized from the start. To match this, I also haploidized the modern genotypes of our reference samples. The issue is that when I run f3(out,pop1,pop2) with the completely pseudohaploid dataset all runs fail (nas in all comparisons). However, when I use the dataset with only ancient pseudohaploid samples it works.

A solution to run Outgroup f3 stats we found is to tun f4(out,pop1;out,pop2), and that works with the completely pseudohaploidized dataset, but I wanted to let you know this issue exist.

Best, Pedro.

uqrmaie1 commented 1 year ago

I suspect that the problem is that the first population (out) consists of only a single pseudohaploidized sample. This makes it impossible to get unbiased estimates of f3, which is why you get NAs.

f4(out,pop1;out,pop2) and f3(out;pop1,pop2) will be identical in the limit as the number of samples in each population increases. With small sample counts, unbiased estimates of f3 depend on a bias correction term for the first population. That term is proportional to sample size, and it can't be computed when one of the populations has only a single pseudohaploidized sample.

In addition to that, the default option for f3 when it is computed from genotype files, is that the base estimate is normalized by the heterozygosity of the first population. Just like the bias correction term, the heterozygosity can only be estimated if the first population has at least two haplotypes (one diploid sample or two pseudohaploid samples).

There are options in the f3 function to get around that limitation. The resulting f3 estimates may be biased, but that bias may be small, and sometimes the bias doesn't matter even if it is large. An example of a case where a large bias doesn't matter is qpGraph, where multiple f3-statistics with the same out population are compared to each other, and it's only the relative difference between f3-statistics that matters.

The heterozygosity normalization can be turned off with outgroupmode = TRUE. The bias correction term can effectively be set to 0 with apply_corr = FALSE. The bias correction term is positive and is normally subtracted from the uncorrected f3-estimate, so setting this to FALSE will result in an upward biased f3-estimate.

When both options are passed to the f3 function, it should give the same result as the f4 function. So f3('prefix', 'out', 'pop1', 'pop2', outgroupmode = TRUE, apply_corr = FALSE) should give the same result as f4('prefix', 'out', 'pop1', 'out', 'pop2').

Hjorvik commented 1 year ago

Thank you very much for the explanation. Knowing that f3() expects a diploid outgroup or more than one sample, we can plan our dataset accordingly. I save the commands, however, in case that's not feasible in the future.

Best, Pedro.