ksamuk / pixy

Software for painlessly estimating average nucleotide diversity within and between populations
https://pixy.readthedocs.io/
MIT License
115 stars 14 forks source link

Using pixy without an all sites VCF? #100

Open milesandersonmn opened 5 months ago

milesandersonmn commented 5 months ago

I have a fairly mundane need for pi and fst estimates that are in the ballpark but not necessarily the most accurate possible. We have a huge number of samples that I don't have the time or resources to generate individual gvcfs for. Can I use pixy on a standard VCF without calling all sites? Is there anything I should know when doing this?

If not are there any tools you all might recommend as an alternative?

Thanks!

ksamuk commented 5 months ago

Hi Miles,

There isn't a quick way around the missing data issue for pi/dxy, I'm afraid. All tools, including pixy, will give you biased estimates in the absence of an all-sites VCF. Note that FST doesn't have the same issue, and any tool will work for that.

The only alternative to the true 'all-sites' workflow that I am aware of is to use mop (https://github.com/RILAB/mop) on your BAM files, and use those results to ballpark the denominators for the estimates.

Sorry that I can't be of more help.

Kieran