ksamuk / pixy

Software for painlessly estimating average nucleotide diversity within and between populations
https://pixy.readthedocs.io/
MIT License
115 stars 14 forks source link

How should I compare my pi results from pixy with studies that use other approaches? #41

Closed tshalev closed 3 years ago

tshalev commented 3 years ago

I have been assessing nucleotide diversity in a species of conifer and recently was able to recall invariant sites to be used in pixy. I ran pixy using a number of different parameters, such as arbitrary windows, and using bed files for single-copy ortholog genes, primary transcripts, etc. My results end up being quite similar, but now I am unsure how to compare them to previously published results that do not use pixy.

For example, using a bed file of genes I get an average pi over all sites of 0.005. A study in Populus deltoides from 2017 (https://doi.org/10.1002/ece3.3466) found an average pi of 0.001 using PopGenome and considered this estimate to be "high nucleotide diversity". In my population, I am expecting diversity to be extremely low. How would you suggest I compare pixy's outcomes to other studies that possibly do not use invariant SNPs? The unfortunate reality is that most researchers do not archive or even call invariant SNPs, so rerunning the analysis of others becomes unfeasible.

I am aware that this does not constitute a fundamental "problem" with the software, but I am wondering whether you have thought about this issue and how comparisons could be made to older studies. Perhaps an approach to standardising results by length of the window being assessed?

Thanks!

ksamuk commented 3 years ago

Hi Tal,

Thanks for your question, it is a perfectly reasonable one! The similarity between pixy's estimates and estimators that assume missing sites are homozygous for the reference allele is a function of the amount of missing data. The more missing data there are, the more biased the estimates of pi and dxy become. In cases where missing data is very high (e.g. computing over windows using RADseq or similar type data), pixy will differ a great deal from other estimators.

If one wanted to standardize estimates post hoc, the key quantity would be the number of sites in the window of interest that had sufficient depth to call a genotype in the focal individuals (i.e. those that would have appeared as invariant sites in an all-sites VCFs). Silas Tittes made a really nice command-line program for computing this quantity from bam files https://github.com/RILAB/mop. You would need access to the original bam files to compute it, but that is unfortunately the only way I am aware of doing this!

I hope that helps. This is unfortunately a very common problem and there isn't really a great solution for it other than encouraging better practices!

tshalev commented 3 years ago

Thanks Kieran. That's what I was expecting I suppose :). Glad to see at least that more people are using pixy in publications over the last year. I will check out the Mop program.