Closed vlrieg closed 3 years ago
Hi Valerie! That check is actually saying there are no invariant sites in your file, which seems to be correct. In real datasets, invariant sites are necessary to get accurate estimates of pi (our paper explores this in more detail). However, in the case of simulated data (and really only in that case!), you can safely assume that anything missing is indeed invariant. So for your purposes, its totally fine to re-run with the --bypass_invariant_check flag, which will give accurate estimates for your simulated data.
Ok clearly it was long past time for me to step away from my computer and take a break :flushed: "But there are lots of variant positions in this vcf file..." :woman_facepalming:
Looks like tskit's ts.diversity() function is working differently than Pixy in calculating Pi somehow. Gonna have to look into that more before I start comparing with my real data. I really appreciate that Pixy makes it so easy to work with my haploid genome data (from P. vivax)! I've tried the "haploid mode" version of VCFtools but Pixy is giving me results more like what I expected to see based on working with my own scripts for calculating pi in vivax.
Cheers, Val
Hi Valerie! That check is actually saying there are no invariant sites in your file, which seems to be correct. In real datasets, invariant sites are necessary to get accurate estimates of pi (our paper explores this in more detail). However, in the case of simulated data (and really only in that case!), you can safely assume that anything missing is indeed invariant. So for your purposes, its totally fine to re-run with the --bypass_invariant_check flag, which will give accurate estimates for your simulated data.
Hi, Ksamuk. I am stuck in the same problem. In my own dataset generated by GATK4 and filtered those sites that are non-biallelic, I run like this
vcf=biallelic.vcf.gz pop=pixy_ID_pop.txt set NUMEXPR_MAX_THREADS=68 pixy --stats pi fst dxy \ --vcf $vcf \ --populations $pop \ --window_size 10000 \ --bypass_invariant_check no \ --n_cores 8
Then it showed Error. nthreads cannot be larger than environment variable "NUMEXPR_MAX_THREADS" (64)Exception: [pixy] ERROR: the provided VCF appears to contain no invariant sites (ALT = "."). This check can be bypassed via --bypass_invariant_check 'yes'.
And did not get any result.
The version is 1.2.7.beta1.
I wondered whether I need to change to '--bypass_invariant_check yes'.
Thanks! Xia
Hi Xia!
You'll want to confirm that you have invariant sites prior to any filtering, as GATK does not emit this by default. If they are indeed in your VCF, filtering out non-biallelic sites will usually also remove invariant sites. You can see one way to get around this issue here: https://pixy.readthedocs.io/en/latest/guide/pixy_guide.html#generate-a-vcf-with-invariant-sites-and-perform-filtering (see under heading "Optional: Population genetic filters").
Running with '--bypass_invariant_check yes' is only recommended for simulated data, and will result in incorrect estimates of pi and dxy in the absence of invariant sites. Hope that helps!
Kieran
Hi, Kieran!
I tried to get invariant sites by adding '-all-sites' in GATK GenotypeGVCFs, and didn't filter it, then combined those two vcf. I tried different vcf. But I still got the same error Exception: [pixy] ERROR: the provided VCF appears to contain no invariant sites (ALT = "."). This check can be bypassed via --bypass_invariant_check 'yes'
.
Does this mean there are not any invariant sites? Can I just use --bypass_invariant_check 'yes'
?
Any suggestions will be appreciated.
Xia
Describe the bug I generated a number of VCF files from simulations done with msprime/tskit, but Pixy doesn't recognize that there variants in the file. As far as I can tell, it isn't an issue with the formatting of the VCF file (tab separation seems ok in BBedit). This same command works fine on "real" VCFs generated by GATK, so it's not an installation problem. I also tried adding
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
to the header of my simulated VCF with the same result.The error message from the command line
$ pixy --stats pi --vcf 0-simple-africa.vcf.gz --population sim_pop.txt --window_size 1000 --output_prefix test-out
OS information MacOS Big Sur 11.2.3
Sample files