millanek / Dsuite

Fast calculation of Patterson's D (ABBA-BABA) and the f4-ratio statistics across many populations/species
160 stars 26 forks source link

How deal with output of Dinvestigate #60

Open RezaFahi opened 1 year ago

RezaFahi commented 1 year ago

Hi I ran Dsuite Dinvestigate and I get the output. Now I want to define windows with high fd value (top 5%) as introgressed regions. how can I deal with fd values that its related D value is a negative?

Callithrix-omics commented 1 year ago

We have been discussion a similar question for fd_M for a dataset on which I am working. Looking at some publications (eg. doi:10.1093/molbev/msac008), I notice that only fd_M are reported (fig 5a from linked pub). My interpretation for fd_M and by extension fd is that that Dsuite only reports positive D values which is related to introgression between P2<>P3 in the ((P1,P2)P3,O) configuration. As a result, the focus is on positive fd_M to reflect the positive Dsuite values for P2<>P3. In my case I use f4 statistic as a cut off for fd_M outliers. Negative fd and fd_M aren't "useless," but we would expect them to be consistent with negative D values. So as a result of reporting positive D values by Dsuite, I am guessing negative fd and fd_M are ignored as in the case of the article I linked. If someone could confirm my interpretation I would appreciate that.

millanek commented 1 year ago

Hi everyone ...

Both f_d and f_dM were developed to look for candidate introgressed regions in cases where the genome-wide D statistic is significantly positive (suggesting gene-flow between P3 and P2). So, you know a priori that you are looking for positive values.

The variance in f_d and f_dM is due to Incomplete Lineage Sorting and also due to sampling noise (depending on sample size; e.g. uncertainty in allele frequency estimates). Because of this variance, it is true that if you run Dinvestigate on almost any dataset you will also get some negative values.

What do the negative values mean? Can we use them? For f_d, you definitely cannot use the negative values. The f_d statistic is asymmetric and only the positive values can be informative in any way. f_dM is symmetric, and so, in principle could be used. But what would the negative values mean? Are they simply due to random variance of the statistic, or are they due to local gene flow between P3 and P1? It is in principle possible that there would be introgression between P3 and P2, while at the same time some local introgressed regions between P3 and P1. But I think that we cannot tell.

For introgression between P3 and P2, we know from the genome-wide evidence that there is overall signal. Therefore, we can interpret the high peaks as candidate introgressed regions. We don't have any such corroborating evidence for P3 and P1, so I would be cautious.

Now, perhaps what I wrote is just a different way to say the same that @Callithrix-omics said. I think her interpretation is correct.

millanek commented 1 year ago

One more thought, with regards to the cutoffs for the positive f_d and f_dM. It is not clear what the best cutoff would be. Top 5% like @RezaFahi is an option. Using the genome-wide f4 statistic like @Callithrix-omics is another option - this will probably result in many outliers though ;). Both of these are rather arbitrary.

Recently I discussed this question with @feilchenfeldt. He came up with the idea to use the difference between the distribution of the positive values and the distribution of the negative values. If the genome-wide D statistic is positive, then you expect the positive f_dM values to be slightly higher in magnitude than the negative values. There will be longer tail of positive values - greater variance caused by the introgression. One could then use the negative value distribution to find the cutoffs for the positive values.

For example, the minimum negative value could be minus 0.5. And the 5th percentile of the negative values could be something like minus 0.43. Whereas the maximum positive value could be 0.7. And the 95th percentile something like 0.48. One could then use for example the magnitude of minimum negative value (0.5) as the cutoff for the candidate introgressed regions.

Now that I think of it, I think that to do this properly, one would have to split the distributions not at zero, but at the genome-wide mean. I.e. subtract the genome-wide mean from all values before doing the above.

Maybe we should add this info to the tutorial/documentation @feilchenfeldt ?

RezaFahi commented 1 year ago

One more thought, with regards to the cutoffs for the positive f_d and f_dM. It is not clear what the best cutoff would be. Top 5% like @RezaFahi is an option. Using the genome-wide f4 statistic like @Callithrix-omics is another option - this will probably result in many outliers though ;). Both of these are rather arbitrary.

Recently I discussed this question with @feilchenfeldt. He came up with the idea to use the difference between the distribution of the positive values and the distribution of the negative values. If the genome-wide D statistic is positive, then you expect the positive f_dM values to be slightly higher in magnitude than the negative values. There will be longer tail of positive values - greater variance caused by the introgression. One could then use the negative value distribution to find the cutoffs for the positive values.

For example, the minimum negative value could be minus 0.5. And the 5th percentile of the negative values could be something like minus 0.43. Whereas the maximum positive value could be 0.7. And the 95th percentile something like 0.48. One could then use for example the magnitude of minimum negative value (0.5) as the cutoff for the candidate introgressed regions.

Now that I think of it, I think that to do this properly, one would have to split the distributions not at zero, but at the genome-wide mean. I.e. subtract the genome-wide mean from all values before doing the above.

Maybe we should add this info to the tutorial/documentation @feilchenfeldt ?

Thanks a lot dear @Callithrix-omics and @millanek for your response to my issue

Dear @millanek I have asked in another issue ( Three questions about the interpretation of Dsuite results #59 ) about an issue regarding the interpretation of the results. I have read several articles that have worked with this program, but I have not found anything clear about it. I will be very grateful if you can help me in this matter

cyadrogarcia commented 1 year ago

Hello everyone: I am having some doubts and I would really appreciate if you could bear with me in this. I have ran Dsuite Dinvestigate using samples collected form a transect to investigate the introgression of P2 into P3 considering the topology ((P1,P2)P3,O). While all the D values are negative, while getting closer to the contact zone, they become less negative. I am interpreting this as indicative of a higher introgression when getting closer to this contact zone, but still not being enough to overcome the influence of P1( P1 and P3 are considered very close subspecies, almost kind of sisters). It is my interpretation correct?

Also I would like to identify windows higly introgressed by P2 i am not sure wheter I should choose those with extremely negative or extremely positive f_dM values?