ksamuk / pixy

Software for painlessly estimating average nucleotide diversity within and between populations
https://pixy.readthedocs.io/
MIT License
115 stars 14 forks source link

Understanding how missing data are taken into account in avg_pi #59

Closed gsgarlata closed 2 years ago

gsgarlata commented 2 years ago

Hi Kieran,

First of all...Thanks for this tool. It is very useful. I have some doubts on the results of my analysis. Looking at the output file ('test_pi.txt'), my understanding is that, for a given window, the average pi within population is given by the 'avg_pi' column. The 'avg_pi' is computed by doing the ratio between 'count_diffs' and 'count_comparisons'. 'count_comparisons' should contain the number of all possible pairwise comparisons between haploid genotypes, excluding missing data comparisons. However, looking at my output, this seems to not be the case (again assuming I have correctly interpreted the output).

Consider for instance the case of 'pop 8' for window_pos_1 at 140001 (you can find the full output and input files in attachment):

pop chromosome window_pos_1 window_pos_2 avg_pi no_sites count_diffs count_comparisons count_missing 8 1 140001 150000 0.001771255060728745 304 49 27664 28

avg_pi = count_diffs / count_comparisons = 49 / 27664 = 0.001762011

I know that seven diploid individuals are included in 'pop 8', which is equivalent to 14 haploid genotypes. Now if I compute all pairwise comparisons among the 14 haploid genotypes, I get (14 13) / 2 = 91 comparisons for one site. If I multiply 91 by the number of sites, I get: 91 304 = 27664, which is exactly the 'count_comparisons' value I have in the output. The way I computed it do not take into account 'missing data' comparisons, which in this case are at least 28. What I do not understand then is how 'avg_pi' takes into account missing data (by excluding them). Am I missing something?

Here is the command I used to run pixy, including all arguments:

input=chromo1.vcf.gz popfile=population_file_p22.txt wsize=10000 threads=10 prefix=test

python $path_pixy/pixy --stats pi --vcf $input --populations $popfile --window_size $wsize --n_cores $threads --output_folder $output_dir --output_prefix $prefix

chromo1.vcf.gz test_pi.txt population_file_p22.txt

Thank you a lot in advance,

Bests,

Gabriele Sgarlata

ksamuk commented 2 years ago

Hi Gabriele!

My pleasure, happy to hear pixy is useful in your work. I just checked and looks like there is a bug with how the 'missing' column is being calculated for pi (it should be multiplied by 13 in your case). Its not used directly in the pi calculation, and so it seems to have slipped our tests. I'll get that updated ASAP. If I remember correctly, we changed what this column represented at some point (used to be the total number of missing genotypes across all sites, I think) but it looks like we only updated it for dxy and not pi.

Re: the actual values:

This case where a subpopulation has totally missing data is a little odd because the "missing comparisons" are only counted for sites with data in other populations (i.e. we don't include missing comparisons that are not represented in the VCF at all), but the no_sites value is adjusted to exclude them. So the missing column is more of a measure of genotype sparseness at sites that are represented in the VCF rather than overall missingness (although that can be determined with the no_sites column).

Anyway hope that helps. I'll get the fix uploaded ASAP.

Cheers,

Kieran

ksamuk commented 2 years ago

Also, quick point of order, the pi value is indeed 49 / 27664, but that evaluates to 0.00177125506 (depending on sig digs, which is the same as the pixy output) and not 0.001762011.

gsgarlata commented 2 years ago

Hi Kieran,

Thanks a lot for having taken the time to clarify my doubts. Now, it is very clear. For your information, I am applying pixy on RAD-seq data, which may explain why there are a lot of missing data across the VCF file. Also, yes, you are right about the pi value (49 / 27664), I did this simple calculation in R, which I have set it with a lower number of sig digs.

Again..Thanks!

Bests,

Gabriele

ksamuk commented 2 years ago

Sure thing! And thanks for pointing this out and for making it easy to quickly find the issue. We actually designed pixy with RAD-seq in mind, so I'm glad to hear it is useful for you!

The update to the "count_missing" calculation is now up on Github (https://github.com/ksamuk/pixy/releases/tag/1.2.7.beta1), and the release with the fix will be on conda-forge in a day or so.

Cheers,

Kieran