simonhmartin / genomics_general

General tools for genomic analyses.
341 stars 93 forks source link

windows with fd > 1 #8

Open hditt33 opened 6 years ago

hditt33 commented 6 years ago

Hi Simon, I used your script to calculate fd statistics in sliding windows across chromosomes for my data. I used a window-size of 50kb, which resulted in 200-600 SNPs used for the analysis. I noticed that some fd values were greater than 1, which I found odd. I looked into the specific windows and I think this effect comes from sites that have SNPs only shared between P1 and P2 similar to this: P1 P2 P3 P4 0.8 0.7 0.0 0.0

These sites are not informative for ABBA/BABA statistics, but they are still used for calculating the denominator for the f-statistic, right? When you set P3=P2, because P2>P3, then this leads to the creation of "artificial" BABAs that only affect the denominator. Thus, the denominator can be smaller than the numerator and fd greater than 1. Because this is summed over the whole window, you can still get a positive D, so the fd value should be valid.

Did you experience this before? Or do you have SNP filtering criteria to avoid this? I am also worried that this could affect the fd estimate in other windows, where it is not greater than 1 but still biased upwards. What do you think?

Best, Hannes

simonhmartin commented 6 years ago

Hi Hannes, I am aware that fd can have strange behaviour and occasionally gives values > 1. In the original paper we said that this was not possible, but we were wrong. Usually, this happens in windows in which D <0, and I think the solution there is to ignore those windows or set fd=0, since there is no detectable excess of shared derived alleles between P2 and P3. However, fd > 1 can occur when D >= 0, and this tends to happen when the number of SNPs considered in the window is very small. I'm surprised that using 200 or more SNPs you still see fd > 1. The numerator and denominator are both computed by summing across all SNPs in a window, so this would require large numbers of sites that behave oddly in a single window, which my simulations suggested is very unlikely. For this reason, I previously have not seen any reason for concern about inflated fd values provided a good number of SNPs is used. But given your finding, this does seem like a realistic possibility. Is there reason to suspect something strange like a very low recombination rate in this species? One thing to try would be be to repeat the simulations as in our MBE paper and ask how often you get excessive fd values as you decrease the number of SNPs used per window. Cheers, Simon

hditt33 commented 6 years ago

Thank you for your quick answer! The species I am comparing are both self-compatible and mostly inbreeding plants, so recombination rates are probably quite low. How would this affect the fD estimates?

I looked a bit more into the issue with sites which have derived alleles only in P1 and P2 (BBAA-sites), which I described above. fhom estimates should not be affected by this sites, so I compared fhom and fd for all windows with D>0. fhom_fd

It looks like the correlation is fine until estimates reach 0.3. After this point both estimators can deviate quite strongly and fD is mostly higher than fhom. So I tested the correlation between (fd-fhom) and the relative amount of BBAA-sites in each window for all windows with fhom>0.3. The correlation is highly significant with r=0.77. fd-hom_bbaa

So I think in some cases where there is actual introgression happening, fd might overestimate the amount of introgression. What are your thoughts on this? Would there be a way to adjust fD in these cases, for example by excluding BBAA-sites?

Best, Hannes