simonhmartin / genomics_general

General tools for genomic analyses.
336 stars 92 forks source link

Extremely high fd output with ABBA-BABA script #25

Open BioGeorge opened 4 years ago

BioGeorge commented 4 years ago

Hi, When using ABBABABAwindows.py to calculate introgression between groups, I come across such a strange value as the figure shows. image

I know little about the ABBA-BABA statistics. But I've read the cited reference and learned about that the fd statitic will be underestimated compared to the D statistic. I wonder how did this high value happen? Is the value right?

The command I used is shown as below. python2.7 ABBABABAwindows.py -g Selected.gono -f haplo -o ABBABABAoutput.csv -w 50000 -P1 Pop1 -P2 Pop2 -P3 Pop3 -O OG -T 10 --minData0.5 --popsFile Poplist --writeFailedWindows --haploid Q,W,E,R,T,Y,U,I,O,P,A,S,F,G,J,H

Thanks.

simonhmartin commented 4 years ago

Hi, This can happen occasionally and results from some numerical weirdness when the number of SNPs is VERY small. As you can see in your figure, the number "sitesUsed" per window is very low in the really bad windows. sitesUsed is all the sites at which there was a biallelic SNP with genotype data for at least half of your samples per population (--minData 0.5). My rule of thumb is that you need at least 100 SNPs for a reliable f estimate. It's interesting that you have so few good sites per 50kb window. Perhaps this is reduced-representation sequencing data? I would recommend increasing the window size and then using a minimum cutoff of 100 sitesUsed per window. You can do this by setting -m 100 or just do the filtering afterwards.

BioGeorge commented 4 years ago

Thanks. I will have a try.