ksamuk / pixy

Software for painlessly estimating average nucleotide diversity within and between populations
https://pixy.readthedocs.io/
MIT License
115 stars 14 forks source link

Why are the results so different when using pixy and vcftools #83

Closed Wennie-s closed 12 months ago

Wennie-s commented 1 year ago

Hi ksamuk, I am using pixy to calculate the nucleotide diversity, but the results are very strange. Let me explain my procedure: (1) First I used GATK 4 to get all sites (variants + inviriants) (2)Then I used VCTtools for site-level filtration: "--remove-indels --max-alleles 2 --minGQ 10 --minQ 30 --max-missing 0.9 --minDP 5" (3) Finally I used pixy to calculate the nucleotide diversity: "pixy --stats pi --vcf Chr10_1.vcf.gz --populations sample.txt --window_size 100000 --n_cores 16 --output_folder /wtmp/user003/Reseq/83samples/Pi/pixy/result --output_prefix Chr10_1"

However, I found that the nucleotide diversity calculated by pixy was too high, for example, for the same window and sam Chromosome, results from pixy is: image but the results from vcftools is (use only variants): image

I can understand that there might be a difference between the two softwares, but the difference is almost 50-80 times. I don't quite understand why this is the case. In fact, for our endangered conifers, the nucleotide diversity should not exceed 0.01.

Can you help me? Thank you!

Wennie-s commented 1 year ago

Sorry, there is a problem with the results of the above vcftools calculation, the real result is: image Can you help me? Thank you!

ksamuk commented 1 year ago

Hi there, pixy only works with VCFs containing invariant sites. The filtration you performed with VCFtools removes all invariant sites. Have a look at the documentation for how to preserve them during filtration.

Wennie-s commented 1 year ago

Thank you very much for your reply, I used vcftools to re-filter all-site files and re-calculated pixy: (1) /data/soft/vcftools/bin/vcftools --gzvcf Chr2_2.vcf.gz --remove-indels --max-missing 0.8 --min-meanDP 10 --recode --recode-INFO-all --stdout; (2) pixy --stats pi --vcf Chr2_2.filter.new.vcf.gz --populations /wtmp/user003/Reseq/83samples/Pi/pixy/sample.txt --window_size 100000 --n_cores 60. but the values are still high: pixy_result: image vcftools result: image Because this species is an endangered species and the population is small, I think the nucleotide diversity should not exceed 0.1, so I am confused~

Wennie-s commented 1 year ago

Because this species is an endangered species and the population is small, I think the nucleotide diversity should not exceed 0.01, so I am confused~

ksamuk commented 1 year ago

Hi Wennie,

pixy and VCFtoos will differ in their estimates of pi based on the amount of missing data. This difference was, in fact, the entire motivation behind creating pixy, and is fully explored in our paper. I encourage you to read it, so you might understand what pixy does differently. The fact that VCFtools and pixy differ in their estimates is always expected and not indicative of a problem.

Regarding "I think the nucleotide diversity should not exceed 0.01, so I am confused", I might ask: why do you expect that specific number? Assuming you've formatted your data correctly, the pi estimates from your current data are not biased in any particular way (again, this was the whole point of pixy). Note that they are quite close to 0.01. Nucleotide diversity varies across the genome, and between samples, so the idea that they are "wrong" because they exceed some value in a particular window seems very strange to me.

I can't help you interpret your results, but if you have a specific reason why you believe pixy is producing incorrect results, please let me know.

Kieran