ksamuk / pixy

Software for painlessly estimating average nucleotide diversity within and between populations
https://pixy.readthedocs.io/
MIT License
115 stars 14 forks source link

Dxy values interpretation? #92

Closed MarinaSci closed 9 months ago

MarinaSci commented 9 months ago

Dear Kieran,

Thank you for pixy !! I work with a mix of pooled and individual data; individual data come from worms and the pooled data from a large population of eggs found in poo samples. I am interested in analysing mitochondrial genomes and nuclear repeat data from both and I use grenedalf for calculating Dxy/Pi from pooled data and I started using pixy ( Dxy/Pi) to process the individuals. I have some populations where n=1 (I know, not ideal) so I could not calculate pi unfortunately.

For sample pairs, I am trying to better understand the Dxy output from pixy. In terms of interpretation, I think that Dxy follows the same principle of Fst (low Fst = low genetic differentiation). I ran PCA analysis (based on allele frequencies), Dxy and Fst for some individual data, and I am not sure how to interpret the Dxy values because they disagree with Fst and PCA (getting low Dxy values but higher Fst values between populations and the PCA plots do show distinct clustering). For populations that should very diverse, I get very low Dxy (output attached); which indicates that they are 'mixing'.

I tried filtering for DP (> 10) and GQ (> 30) as you suggest on the paper and I get the same results. How do I interpret the value of Dxy = 0.0647012529439746 between China and the Honduras, for example, when the corresponding Fst is 0.7393705489876253? Would be it because I am analysing mitochondrial genomes? Are there any other assumptions for Dxy by pixy? Should I filter the VCF further before running pixy?

I am attaching a subset of the VCF, the populations file and sharing the command I use. Any help is greatly appreciated, thank you!!

All the best, Marina

Command for pixy: pixy --stats pi fst dxy --vcf TT_bcftools_for_pixy_NOMINIMUMALLELEFREQ_mtDNA_genes_invds.recode.vcf.gz --chromosomes 'NC_017750_Trichuris_trichiura_mitochondrion_complete_genome' --populations TT_INDVs.pops --window_size 20000 --n_cores 4 --bypass_invariant_check 'yes' --fst_type 'hudson' --output_prefix TT_indvs_pixy_output_hudson

Files to reproduce result: TT_bcftools_for_pixy_NOMINIMUMALLELEFREQ_mtDNA_genes_invds_n5000.recode.vcf.gz

TT_INDVs.pops.txt

TT_indvs_pixy_output_hudson_fst.txt TT_indvs_pixy_output_hudson_dxy.txt

ksamuk commented 9 months ago

Hi Marina,

I'm unfortunately not able to help with the biological interpretation of your data. I will say that mitochondria tend to have different evolutionary dynamics/diversity than the nuclear genome, and low sample sizes could result in high variance in all estimates of pi/dxy/FST.

If you have any issues running or installing pixy, or believe you'd discovered an error, please re open an issue.

All the best,

Kieran