ksamuk / pixy

Software for painlessly estimating average nucleotide diversity within and between populations
https://pixy.readthedocs.io/
MIT License
115 stars 14 forks source link

"the provided VCF appears to contain no invariant sites" #89

Closed MafaldaSFerreira closed 9 months ago

MafaldaSFerreira commented 10 months ago

I am running pixy with per chromosome vcf files containing variant and invariant sites. However, for some of the files, pixy reports that there are no invariant sites and fails with the following error:

Checking for invariant sites...Exception: [pixy] ERROR: the provided VCF appears to contain no invariant sites (ALT = "."). This check can be bypassed via --bypass_invariant_check 'yes'.

I am running pixy with the following command (this is an example for chromosome 10):

pixy --stats pi fst dxy --populations population_files/clusters_v01.txt --vcf chromosomes_maf0.01/herring_sentieon_125ind_230722_filter_setGT_miss0.2.vcf.minDP4.0maxDP3.0avg_miss0.2.maf0.01.ALLSites.chr10.vcf.gz --window_size 20000 --n_cores 2 --output_folder results/clusters_v01_maf0.01_20231011 --output_prefix clusters_v01.chr10.maf0.01.20kb.popgenpixy.out

I do have invariant sites in the vcf files. Here is a small subset of such sites for chromosome 10:

bcftools view --max-ac 2:nref herring_sentieon_125ind_230722_filter_setGT_miss0.2.vcf.minDP4.0maxDP3.0avg_miss0.2.maf0.01.ALLSites.chr10.vcf.gz | bcftools query -f '%CHROM\t%REF\t%ALT\t[%GT\t ]\n' | head -n1000 > chr10.maf0.01.1000.invariant.txt

And if I run bcftools stats on the vcf of chromosome 10, it says the total number of no-Alts is 17184629, whereas the number of SNPs is 492509 (also attaching the output of bcftools stats).

The only thing I could think of is that in this particular vcf the variant sites start before the invariant sites... Would that trigger the warning?

chr10.maf0.01.1000.invariant.txt herring_sentieon_125ind_230722_filter_setGT_miss0.2.vcf.minDP4.0maxDP3.0avg_miss0.2.maf0.01.ALLSites.chr10.stats.txt

ksamuk commented 10 months ago

Hi There!

The check for invariant sites is fairly conservative, so if you are 100% sure that you have invariant sites correctly represented in your vcf, you can bypass the check by including --bypass_invariant_check 'yes'.

MafaldaSFerreira commented 10 months ago

I did run with —bypass_invariant_check “yes” but the results I got for the dxy scan were quite strange. Blocks of very high dxy interspersed with blocks of dxy 0 across a chromosome… Which don’t look like anything I have seen before for other datasets I have worked with in the species. Maybe there is something funky in our filtering this time, and I will double check that first and come back if I have further questions.

Thanks, Mafalda

On 20 Oct 2023, at 15:19, Kieran Samuk @.***> wrote:

Hi There!

The check for invariant sites is fairly conservative, so if you are 100% sure that you have invariant sites correctly represented in your vcf, you can bypass the check by including --bypass_invariant_check 'yes'.

— Reply to this email directly, view it on GitHub https://github.com/ksamuk/pixy/issues/89#issuecomment-1772729169, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGVCG6YCVQHECGVHWHGX6CLYAJ275AVCNFSM6AAAAAA53RYFQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZSG4ZDSMJWHE. You are receiving this because you authored the thread.