simonhmartin / genomics_general

General tools for genomic analyses.
322 stars 92 forks source link

ABBABABA.py encountered in double_scalars #81

Open Dkaaaaa opened 2 years ago

Dkaaaaa commented 2 years ago

Hello, simonh martin

When I run the following command ABBABABAwindows.py -w 200000 -s 5000 -g chr-level-output.geno.txt -o test.csv -f diplo -T 80 -P1 XXVIb -P2 XXVIe -P3 XXVIa_19 -O XXVIc --popsFile 34453b-group.txt --addWindowID -m 10 --writeFailedWindows

It returns the following warning: /data/ccy/software/genomics_general-master/genomics.py:1386: RuntimeWarning: invalid value encountered in double_scalars return f4(p1,p2,p3,p4).sum()1./f4(p1,pd,pd,p4).sum() /data/ccy/software/genomics_general-master/genomics.py:1370: RuntimeWarning: invalid value encountered in double_scalars return f4(p1,p2,p3,p4).sum()1./((1 - p1)p2p3(1-p4) + p1 (1-p2)p3(1-p4)).sum() /data/ccy/software/genomics_general-master/genomics.py:1414: RuntimeWarning: invalid value encountered in double_scalars return f4(p1,p2,p3,p4).sum()*1./denom.sum()

and my output has a lot of nan value in the column of sitesused

![Uploading image.png…]()

And I think may be too few snps in my setting windows, but it seems to not work after using a larger window size or a smaller --minSites sites , it's my data or command problems? and what kind of sites will ABBABABAwindows.py choose?

here is my input genotype file and popgroups file chr-level-output.geno.txt

34453b-group.txt

best wishes!

simonhmartin commented 2 years ago

Hi ChanyeeNicole, I've had a look at your data, and re-run your command. While there are some windows with no usable SNPs (see the sitesUsed column), the bigger issue is that even where there are sufficient numbers of SNPs, there is almost zero sequence divergence between your P1 and P2 populations. I figured this out by running popgenWindows.py. In all the windows with nan for the D and fd statistics, the sequence divergence between XXVIb and XXVIe is zero. The ABBA BABA test relies on the existence of sites where allele frequency differs between the P1 and P2 populations. If they are too similar, there is simply no signal with which to detect gene flow. The warning sign here is that, despite having several hundred usable SNPs in each window, the number of ABBA and BABA sites observed in most windows is zero. The warnings come from the fact that the script is trying to divide by the total number of observed ABBA and BABA, and it cannot divide by zero.

Another minor issue is that you listed your data as diplo format, but actually it looks haploid. diplo format encodes heterozygous sites using ambiguity codes, but I don't see any of those. It is possible to specify ploidy. I noticed that this was not too easy in the existing version of the script, so I actually just updated it to make it easier to specify ploidy. With the version I have just pushed to github, you can specify -f haplo --ploidy 1 but note this will ONLY work if you download the latest version of the script. Simon