simonhmartin / genomics_general

General tools for genomic analyses.
326 stars 92 forks source link

Can fd determine the direction of gene flow? #74

Open kuangzhuoran opened 2 years ago

kuangzhuoran commented 2 years ago

Specifically from P2 to P3 or from P3 to P2, not between P2 and P3 Just pass (Martin et al. 2015 Equation 6). I don't understand it very well When I was reading some literature, I found that the fd statistics they calculated had a clear direction and even quantified The author only mentioned the calculation of fd in Methods, and did not say more, and the equation given by the author is also consistent with (Martin et al. 2015 Equation 6)

ref: Genomic analysis of the domestication and post-Spanish conquest evolution of the llama and alpaca

e48cbe1d705b74d18b71eb965b2c8cb Fig2a, Fig2b Fig. 2 Inferences of admixture proportion and time and demographic history. a Estimated introgression proportions from llama to alpaca by using Local Ancestry Inference (LAI) (left). The X-axis indicates the guanaco-ancestry proportions in each sequenced alpaca individuals (Y-axis; N=8). The right panel showed the introgression proportions by using ABBA-BABA (f d ) method. The arrow showed the introgression direction from llama to alpaca.

simonhmartin commented 2 years ago

fd is not specifically designed to detect directionality. However, iff gene flow is very polarised, then fd will tend too give an underestimate when the direction is incorrectly assumed. This is shown in Figure 2 of the MBE paper (bottom right panel). So in your case above it seems that the direction in panel a is far more prevalent. Given your sampling design, you could of course use the Dfoil method. I don't think it's necessary to use their software to do this, because the logic of the method is simply to compute four different D statistics and compare their values, which can be done with other tools.

Jungal10 commented 2 years ago

Just following up here. Could you elaborate on how calculating multiple fd values can help decipher gene flow directionality? For example, an A-B-C trio can have an fd of 0.1, and B-A-C trio has an fd of 0.2. How does this help me to determine the direction? Thanks

simonhmartin commented 2 years ago

fd is not able to give you directionality information if you have only three taxa, but if you have a fourth, you can get some insight. Say your tree is ((A,B),(C,D),O), and say only B and C are in contact such that you only suspect gene flow between these two, then you are in a position to compute fd in two arrangements:

Arrangement 1: P1 = A, P2 = B, P3 = C, outgroup = O Arrangement 2: P1 = D, P2 = C, P3 = B, ourtgroup = O

fd implicitly assumes gene flow from P3 into P2. This means the Arrangement 1 is assuming C->B and Arrangement 2 is assuming B->C. If the true direction was from C->B, then Arrangement 1 will give a good estimate for the true proportion of introgression, while Arrangement 2 will give an underestimate. This is because, in arrangement 2, many of the derived alleles that flowed from C into B will also be present in D due to its shared ancestry with C. This weakens the signal and reduces the estimate.

Based on this reasoning, I suggested that most gene flow has gone from llama into alpaca, rather than the other way around.

Two important assumptions in this reasoning are:

  1. The P1 population is always completely isolated from P3
  2. All gene flow occurred after the P1 and P2 split in both arrangements

If, in your case, guanaco has also received gene flow from alpaca (possibly via llama), or if it only split from llama very recently, and gene flow had occurred before this split, this would also give a lower alpaca->llama fd estimate even if gene flow was bidirectional.

I hope this helps?

Simon

simonhmartin commented 2 years ago

A final additional comment. For estimating a single genome-wide level of introgression, I don't recommend fd, which is specifically designed for narrow windows. For a single genome-scale estimate, I recommend the standard f estimator, which can be computed using Dsuite, or if you want to keep track of every step, by following my tutorial here: https://github.com/simonhmartin/tutorials/tree/master/ABBA_BABA_whole_genome

dcmain commented 10 months ago

Hi Simon I just wanted to follow up on your very coherent explanation given above. Is the idea that fd will give a better approximation of gene flow if your population arrangement matches the expectation of P3 to P2 also applicable to D? I have computed D and the f4 ratio for an arrangement of trios that exactly matches your example above i.e.

Arrangement 1: P1 = A, P2 = B, P3 = C, outgroup = O Arrangement 2: P1 = D, P2 = C, P3 = B, ourtgroup = O

Would I be correct in assuming that if D is higher for arrangement 1 than for arrangement 2 then the direction of gene flow is from C -> B? Or is D computing bidirectional gene flow? I ask because while you can compute f stats in sliding windows to more accurately approximate the proportion of admixture, I am not sure how to summarize these kinds of statistics across the genome in a way that allows me to make a more global assessment of introgression between taxa. Individual windows might show strong signals of introgression, but its tricky to compare f statistics between multiple closely related taxa for individual windows and get a sense of genome-wide admixture. So if the goal is to get an idea of which taxa show signals of admixture, is the D statistic (together with p or Z) appropriate?

dcmain commented 10 months ago

I just realized that your second comment about the f estimator in Dsuite answered my question about estimating a single genome-wide level of introgression. So I guess my only question is, can the f4 ratio give you an idea of the direction of gene flow in the same way that you described above for fd?