simonhmartin / genomics_general

General tools for genomic analyses.
343 stars 93 forks source link

popgenWindows.py Dxy estimates are large with vcf with invariant sites #97

Closed MatteoSebastianelli closed 1 year ago

MatteoSebastianelli commented 1 year ago

Hi Simon,

Thank you for providing he community with this important tool! I just have a question, using your popgenWindows.py to calculate genome-wide Dxy (my vcf includes invariant sites), I noticed that the estimates I was getting are two orders of magnitude than they tipically are for birds. Is this python script getting per-SNP estimates rather than per-nucleotide?

Best, Matteo

simonhmartin commented 1 year ago

Hi Matteo, Are your values higher or lower than you expected (I think you left out a word)? People often see very high values using my script when invariant sites are excluded, but if yours includes invariants, there is no reason to expect a difference. We have found very similar numbers to those from pixy for example. Simon

MatteoSebastianelli commented 1 year ago

Hi Simon,

yes, I did leave out a word! I meant higher. The problem is that I am getting similar estimates when using both vcfs with and without invariant sites. I wonder whether filtering (e.g. max-missingness, maf...) might have something to do with this?

Matteo

simonhmartin commented 1 year ago

Hi Matteo, Sorry for the delay. This is very strange. If you send me a small subset of your input files (with and without invariants) I can try to diagnose what is happening. Simon

MatteoSebastianelli commented 1 year ago

Hi Martin,

Sure, I have uploaded the subsets here: https://drive.google.com/drive/folders/1S62j7N1gy7I8j211u49AKx0WXKXSjVzT?usp=sharing

Best, Matteo

simonhmartin commented 1 year ago

Hi Matteo, Your "all sites" file does not contain invariant sites.

#CHROM  POS ID  REF ALT
SUPER_25    24381   .   T   C
SUPER_25    104391  .   T   G
SUPER_25    104404  .   T   G
SUPER_25    107757  .   T   G

Note that each site has both a reference and alternate allele, and all the thousands of positions between these variants are not present.

MatteoSebastianelli commented 1 year ago

Hi Simon,

Sorry for the delay in the response, but I wanted to make sure I ran everything correctly. I ran the pipeline again and made sure the file actually contains invariant sites as well, and after running your script I now get the expected values of Dxy. Thank you for your help and sorry for bothering you with this!!

Matteo

simonhmartin commented 1 year ago

Hi Matteo, Great - thanks for confirming. All the best!