simonhmartin / genomics_general

General tools for genomic analyses.
339 stars 93 forks source link

Suggestion for behavior of -m flag in popgenWindows.py #24

Open MafaldaSFerreira opened 4 years ago

MafaldaSFerreira commented 4 years ago

Hi Martin,

While running popgenwindows on an alignment with several species (I have a single individual representing each species) to calculate Dxy between species pairs, I noticed that if I have several species in the geno file, setting -m will exclude windows with less then "m" sites for the global alignment with all species, but not for the specific comparison of two species. For example, a given window might have information for 2000 sites if I take all individuals into account, but when comparing individuals A and B, in fact there might only be 1000 sites. If -m is set to 2000, this window of 1000 sites will not be excluded for the specific comparison of A and B. Also, "sites" in the output will be 2000, reflecting the information for the global alignment, so I can't filter out these windows a posteriori in the specific pairwise comparisons.

For testing purposes, I've been going around this by creating an intermediate geno file just with the two individuals I wish to compare and with no missing data, but the process of creating this intermediate file for each comparison and then running popgenwindows is quite time consuming and takes up a lot of space. Using the same example above, I've confirmed that indeed the Dxy calculations are taking into account the available sites between the comparison of A and B (i.e., using the example above, the script is dividing differences in the window by the 1000 informative sites between A and B, and not by 2000 sites that exist in the entire alignment). So, would it be possible to add "nan" for this particular window and pairwise comparison, for example? Or is there a way to output the number of sites per window for a specific species pair, so it would be possible to filter out specific windows a posteriori?

Thank you, Mafalda

simonhmartin commented 4 years ago

Hi Mafalda, This is a really good suggestion and something that I have thought about several times but never implemented. I will try get to it soon. Simon