ksamuk / pixy

Software for painlessly estimating average nucleotide diversity within and between populations
https://pixy.readthedocs.io/
MIT License
117 stars 14 forks source link

Single site calculation of dxy #40

Closed BaHole closed 3 years ago

BaHole commented 3 years ago

Hi , I want to calculate my dxy in single site with pixy ,but "--window_size" is required , so my command is "pixy --stats dxy pi fst --vcf hardfiltered.vcf.gz --window_size 1 --bypass_invariant_check 'yes' --populations pop.file --n_cores 12" . However , my result of dxy contains so much " NA " ,even some of avg_dxy shows 0.5 or 1 which is too high , so I want to know how to calculate single site calculation of dxy .Thanks a lot !

ksamuk commented 3 years ago

Hi there!

Can you send me a test VCF where I can see the behavior you are describing? Its sounds like you might just have sparse data, in which case pixy is behaving as expected. Dxy is usually NA if there aren't enough genotypes in both populations to calculate an estimate. Single site estimates of dxy = 0.5 or 1.0 are by no means unusual, particularly if you have sparse data (1 would just mean a fixed difference at that site). The variability of single-site estimates (particularly when you have few genotypes per site) is one reason why it can be helpful to average pi/dxy estimates over windows.

ksamuk commented 3 years ago

Thanks, can you also include your populations file?

ksamuk commented 3 years ago

I had a quick look at your VCF and it seems like (1) you haven't included invariant sites in your VCF, (2) there is a large amount of missing/low quality data, and (3) you haven't performed site-level filtration. Together, those three issues mean that your estimates with be both biased and noisy. So, the NAs you are seeing in the pixy site-level estimates are "real" in the sense that they reflect the actual patterns of missingness in your data.

The good news is that we've included full information on how to fix all these issues in the pixy manual:

For information on generating an all-sites VCF, see here https://pixy.readthedocs.io/en/latest/generating_invar/generating_invar.html

For information on how to perform basic site-level filtration see here: https://pixy.readthedocs.io/en/latest/guide/pixy_guide.html#generate-a-vcf-with-invariant-sites-and-perform-filtering

Given the extent of missing data in your dataset, all single-site estimates of diversity/divergence will be inherently noisy. I strongly recommend you use windowed estimates (this is what most published estimates are). In your case, it probably makes the most sense to look at diversity/divergence at the level of whole transcripts (you can use a BED file for this). At a minimum, I'd recommend a window size of 1000bp (note that this is still a fairly small window size). We included single-site estimates with the idea that they would be later aggregated into windows but the user. Single site estimation is also much slower than the windowed approaches.

Once you've had a look at the manual and decided what you want to do about window size, give pixy another shot! Post a new issue here if you run into problems.