ksamuk / pixy

Software for painlessly estimating average nucleotide diversity within and between populations
https://pixy.readthedocs.io/
MIT License
115 stars 14 forks source link

No invariant sites found in simulated VCF file #28

Closed vlrieg closed 3 years ago

vlrieg commented 3 years ago

Describe the bug I generated a number of VCF files from simulations done with msprime/tskit, but Pixy doesn't recognize that there variants in the file. As far as I can tell, it isn't an issue with the formatting of the VCF file (tab separation seems ok in BBedit). This same command works fine on "real" VCFs generated by GATK, so it's not an installation problem. I also tried adding ##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location"> to the header of my simulated VCF with the same result.

The error message from the command line $ pixy --stats pi --vcf 0-simple-africa.vcf.gz --population sim_pop.txt --window_size 1000 --output_prefix test-out

[pixy] pixy 1.0.4.beta1
[pixy] See documentation at https://pixy.readthedocs.io/en/latest/

[pixy] Validating VCF and input parameters...
[pixy] Checking write access...OK
[pixy] Checking CPU configuration...OK
[pixy] Checking for invariant sites...Exception: [pixy] ERROR: the provided VCF appears to contain no invariant sites (ALT = "."). This check can be bypassed via --bypass_invariant_check 'yes'.

OS information MacOS Big Sur 11.2.3

Sample files

##fileformat=VCFv4.2
##source=tskit 0.3.5
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=50000>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  tsk_0   tsk_1   tsk_2   tsk_3   tsk_4   tsk_5   tsk_6   tsk_7   tsk_8   tsk_9   tsk_10  tsk_11  tsk_12  tsk_13  tsk_14  tsk_15  tsk_16  tsk_17  tsk_18  tsk_19  tsk_20  tsk_21  tsk_22  tsk_23  tsk_24  tsk_25  tsk_26  tsk_27  tsk_28  tsk_29  tsk_30  tsk_31  tsk_32  tsk_33  tsk_34  tsk_35  tsk_36  tsk_37  tsk_38  tsk_39  tsk_40  tsk_41  tsk_42  tsk_43  tsk_44  tsk_45
1   409 .   A   C   .   PASS    .   GT  1   0   0   1   1   0   0   1   1   0   1   0   0   0   0   1   0   1   1   0   0   1   1   1   1   0   1   0   0   1   1   1   0   0   0   0   1   1   1   1   0   1   0   0   0   0
1   510 .   A   C   .   PASS    .   GT  0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
1   599 .   A   G   .   PASS    .   GT  0   1   1   1   0   1   0   1   0   1   0   0   0   0   1   0   0   0   0   1   1   0   0   0   0   1   0   1   1   0   0   0   1   1   0   1   0   0   0   0   0   0   1   1   1   1
1   617 .   C   G   .   PASS    .   GT  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
1   675 .   T   C   .   PASS    .   GT  0   1   1   1   0   1   0   1   0   1   0   1   0   1   1   0   1   0   0   1   1   0   0   0   0   1   0   1   1   0   0   0   1   1   1   1   0   0   0   0   0   0   1   1   1   1
1   885 .   A   T   .   PASS    .   GT  0   0   0   0   1   0   1   0   0   0   0   0   1   0   0   1   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   1   1   1   0   0   0   0
ksamuk commented 3 years ago

Hi Valerie! That check is actually saying there are no invariant sites in your file, which seems to be correct. In real datasets, invariant sites are necessary to get accurate estimates of pi (our paper explores this in more detail). However, in the case of simulated data (and really only in that case!), you can safely assume that anything missing is indeed invariant. So for your purposes, its totally fine to re-run with the --bypass_invariant_check flag, which will give accurate estimates for your simulated data.

vlrieg commented 3 years ago

Ok clearly it was long past time for me to step away from my computer and take a break :flushed: "But there are lots of variant positions in this vcf file..." :woman_facepalming:

Looks like tskit's ts.diversity() function is working differently than Pixy in calculating Pi somehow. Gonna have to look into that more before I start comparing with my real data. I really appreciate that Pixy makes it so easy to work with my haploid genome data (from P. vivax)! I've tried the "haploid mode" version of VCFtools but Pixy is giving me results more like what I expected to see based on working with my own scripts for calculating pi in vivax.

Cheers, Val

XiaXiaTianTian commented 1 year ago

Hi Valerie! That check is actually saying there are no invariant sites in your file, which seems to be correct. In real datasets, invariant sites are necessary to get accurate estimates of pi (our paper explores this in more detail). However, in the case of simulated data (and really only in that case!), you can safely assume that anything missing is indeed invariant. So for your purposes, its totally fine to re-run with the --bypass_invariant_check flag, which will give accurate estimates for your simulated data.

Hi, Ksamuk. I am stuck in the same problem. In my own dataset generated by GATK4 and filtered those sites that are non-biallelic, I run like this vcf=biallelic.vcf.gz pop=pixy_ID_pop.txt set NUMEXPR_MAX_THREADS=68 pixy --stats pi fst dxy \ --vcf $vcf \ --populations $pop \ --window_size 10000 \ --bypass_invariant_check no \ --n_cores 8 Then it showed Error. nthreads cannot be larger than environment variable "NUMEXPR_MAX_THREADS" (64)Exception: [pixy] ERROR: the provided VCF appears to contain no invariant sites (ALT = "."). This check can be bypassed via --bypass_invariant_check 'yes'. And did not get any result. The version is 1.2.7.beta1. I wondered whether I need to change to '--bypass_invariant_check yes'.

Thanks! Xia

ksamuk commented 1 year ago

Hi Xia!

You'll want to confirm that you have invariant sites prior to any filtering, as GATK does not emit this by default. If they are indeed in your VCF, filtering out non-biallelic sites will usually also remove invariant sites. You can see one way to get around this issue here: https://pixy.readthedocs.io/en/latest/guide/pixy_guide.html#generate-a-vcf-with-invariant-sites-and-perform-filtering (see under heading "Optional: Population genetic filters").

Running with '--bypass_invariant_check yes' is only recommended for simulated data, and will result in incorrect estimates of pi and dxy in the absence of invariant sites. Hope that helps!

Kieran

XiaXiaTianTian commented 1 year ago

Hi, Kieran! I tried to get invariant sites by adding '-all-sites' in GATK GenotypeGVCFs, and didn't filter it, then combined those two vcf. I tried different vcf. But I still got the same error Exception: [pixy] ERROR: the provided VCF appears to contain no invariant sites (ALT = "."). This check can be bypassed via --bypass_invariant_check 'yes'. Does this mean there are not any invariant sites? Can I just use --bypass_invariant_check 'yes' ? Any suggestions will be appreciated.

Xia