ksamuk / pixy

Software for painlessly estimating average nucleotide diversity within and between populations
https://pixy.readthedocs.io/
MIT License
115 stars 14 forks source link

Make Pixy into a library for importing into Python scripts #52

Closed vlrieg closed 2 years ago

vlrieg commented 2 years ago

I've been using Pixy from the command line to calculate summary statistics on my VCF files and it works great. However some of my data are from simulations generated in a Python script, so it would be fantastic if I could call Pixy functions from within the simulation script itself instead of having to print to VCF first and then read in those output files with yet another script to do analysis on the results.

I'm working on implementing this myself but it's been slow going :) I'm planning to have my simulated genotype data in array format as described by the tskit documentation here: https://tskit.dev/tutorials/getting_started.html#scikit-allel.

I know tskit has its own built-in summary statistics functions but I want to calculate my statistics with the same tools for both my simulated and observed datasets, and I haven't had success with trying to import my observed data into tskit format via tsinfer.

Thanks for considering this request!

ksamuk commented 2 years ago

Hi Val! I think you can basically achieve this by using scikit-allel. The very important step for calculating pi/dxy correctly is the use of an accessibility mask (a list of sites that have sufficient genotyping depth to be included). You'll want to double-check, but the diversity/divergence stats in scikit-allel should be unbiased (and identical to pixy) as long as you include invariant sites and/or a mask. We use custom-made functions for our summary stat calculations, which you could also import if you wanted (they are compatible with all the scikit-allel data structures). Hope that helps!

Edit: I should mention that this would not include a lot of the convenience/quality of life functions included with pixy (window sizes, all the input file validations, etc.), which is really the bulk of what pixy does.

vlrieg commented 2 years ago

That's super helpful to know, thanks!!! I have gotten scikit-allel functions working with this data already so I'll double check that it matches the pixy output.