uqrmaie1 / admixtools

https://uqrmaie1.github.io/admixtools
71 stars 14 forks source link

D statistic estimation #38

Closed juanenciso14 closed 11 months ago

juanenciso14 commented 1 year ago

Hi,

I am trying to figure out the right order to set up populations for calculating D statistics with the f4 function. The interpretation of these results will depend on how D is defined in the package. In the documentation of the function it suggests that it is doing something equivalent to ABBA - BABA on the numerator. However, in Patterson 2012 the numerator is estimated as BABA - ABBA. Some of our results suggest that the program can be doing BABA - ABBA, as opposed to what is found in the documentation of f4. Is this the case?

Thank you in advance!

uqrmaie1 commented 1 year ago

I didn't consider the distinction between ABBA - BABA and BABA - ABBA important when writing the documentation, since the sign can easily be flipped by changing the order of the populations, but I think you're right in that it should read BABA - ABBA, and that is in fact what is being calculated. f4 is identical to the numerator in the D statistic: f4(a, b; c, d) = P(BABA) - P(ABBA). The denominator of the D statistic, P(BABA) + P(ABBA), is always positive, so the sign of f4 and D should always be the same, for the same order of populations.

juanenciso14 commented 11 months ago

Thank you for clarifying!

smallfishcui commented 5 months ago

But what does the result mean? A positive value of f4 means P(BABA) - P(ABBA) is positive, and there is excessive gene flow between pop2 and pop3, or pop1 and pop4? Sorry for this naive question- I have used admixtools but since there are quite many changes between version 1 and 2, it would be nice to verify if I m understanding it correctly.

thanks, Cui

uqrmaie1 commented 5 months ago

I find it easiest to think of it as a correlation of allele frequency differences. A positive value of f4(a, b; c, d) and of D(a, b; c, d) mean that there is a positive correlation of the allele frequency differences a-b and c-d. That means that a and c share some genetic drift with each other, relative to b and d. A negative f4(a, b; c, d) means that a and d share some genetic drift with each other, relative to b and c. If f4(a, b; c, d) is zero, then a and b form a clade relative to c and d.