uqrmaie1 / admixtools

https://uqrmaie1.github.io/admixtools
71 stars 14 forks source link

Distances between populations. #36

Open smd555 opened 1 year ago

smd555 commented 1 year ago

Good day! I have a question regarding Admixtools 2. I am interested in the following: how to calculate the genetic distance between populations or individuals so that this distance is linear and relatively additive and proportional to the modulus of the allele frequency difference? For example, we have three populations a, b and c. Suppose the distances between these populations are Dab, Dbc and Dac respectively. By linearity, I also mean that each of these distances does not exceed the sum of the other two. For example: Dac <= (Dab + Dbc). By proportionality I mean that, for example, if the modulus of the difference in allele frequencies between a and c is twice as large as the modulus of the difference between a and b, then the distance ratio should be 2:1. How to achieve this? If I understand correctly, the Fst is non-linear. Also non-linear is simple f2 (f2(A,B)=(∑ (aj-bj)2 )/M). Maybe something like this is needed?: (∑|aj-bj| )/M That is, instead of squares, use modules, as it is used in linear deviation? Are there any corresponding functions in Admixtools 2? Is it possible that I am wrong and other formulas are used for this?

Best regards

uqrmaie1 commented 1 year ago

This is a bit late, but I think you raised an interesting question.

What you call linearity (Dac <= (Dab + Dbc)) is sometimes called triangle inequality, and you are right in that both f2 and (∑|aj-bj| )/M have that property (regardless of how a, b, and c are related to each other), whereas Fst doesn't have it.

f2 doesn't have the proportionality property as you define it, but it has another property which is perhaps more important, and which (∑|aj-bj| )/M doesn't satisfy: If a, b, and c are related like a -> b -> c or like a <- b -> c then f2(a, c) = f2(a, b) + f2(b, c).

This is because f2 is the variance of allele frequency differences between two populations, and the sum of the variances of two independent random variables (drift between a and b and drift between b and c) is equal to the variance of the sum of both (drift between a and c).

See Ancient Admixture in Human History, section Additivity of F2 along a tree branch for more details.