Closed davidecarlson closed 3 years ago
Hi Dave,
Thanks for your message. That is indeed a strange problem. I just went back and double checked, and have not encountered this issue in any of the GATK-generated datasets we've been working with. Invariant sites tend to have (much) lower QUAL than variant sites, but I am seeing plenty of invariant sites with missing genotypes that still have QUAL scores. I could be missing something, but looks like the only case where there is not a QUAL score is if ALL the genotypes are missing at a site.
Would you be willing to share some info about your upstream GATK pipeline? Is it possible that there was some pre-filtering done to the VCF that dropped those QUAL scores? I know VCFtools and other tools will sometimes recalculate site-level scores unless you specify otherwise.
Let me know,
Kieran
Hi Kieran, Thanks for your response! My pipeline does have a few steps after the joint genotyping. It's a snakemake pipeline. Here are the relevant rules:
# jointly genotyping all of the samples together
rule genotype_rnd2:
input:
gvcf=config['results_loc']+"/gvcf/all.recal.g.vcf.gz",
ref=REF
output:
vcf=config['results_loc']+"/vcfs/all.recal.vcf.gz"
log:
config['results_loc']+"/logs/jointgenotype/genotype.rnd2.log"
params:
java_opts="-Xmx12G",
extra="--include-non-variant-sites true"
wrapper:
"0.50.4/bio/gatk/genotypegvcfs"
# apply some of the standard filters based partly on the BROAD recommended values
rule filter_rnd2:
input:
vcf=config['results_loc']+"/vcfs/all.recal.vcf.gz",
ref=REF
output:
vcf=config['results_loc']+"/vcfs/all.recal.filtered.vcf.gz"
log:
config['results_loc']+"/logs/jointgenotype/filter.rnd2.log"
params:
java_opts="-Xmx12G",
filter1="QD_filter",
filter1exp="QD < 3.0",
filter2="FS_filter",
filter2exp=" FS > 50.0",
filter3="SOR_filter",
filter3exp="SOR > 3.0",
filter4="MQ_filter",
filter4exp="MQ < 40.0",
filter5="MQRS_filter",
filter5exp="MQRankSum < -8.0",
filter6="RPRS_filter",
filter6exp="ReadPosRankSum < -8.0",
shell:
"gatk --java-options {params.java_opts} VariantFiltration -R {input.ref} -V {input.vcf} -O {output.vcf} "
"--filter-name '{params.filter1}' --filter-expression '{params.filter1exp}' "
"--filter-name '{params.filter2}' --filter-expression '{params.filter2exp}' "
"--filter-name '{params.filter3}' --filter-expression '{params.filter3exp}' "
"--filter-name '{params.filter4}' --filter-expression '{params.filter4exp}' "
"--filter-name '{params.filter5}' --filter-expression '{params.filter5exp}' "
"--filter-name '{params.filter6}' --filter-expression '{params.filter6exp}' 2> {log}"
# select sites where there is at least one genotype call
MAX_MISSING = len(SAMPLES) - 1
rule select_rnd2:
input:
vcf=config['results_loc']+"/vcfs/all.recal.filtered.vcf.gz",
ref=REF
output:
vcf=config['results_loc']+"/vcfs/all.recal.filtered.min1geno.vcf.gz"
log:
config['results_loc']+"/logs/selectvariants/select.rnd2.log"
params:
java_opts="-Xmx12G",
missing=MAX_MISSING
shell:
"gatk --java-options {params.java_opts} SelectVariants -R {input.ref} -V {input.vcf} --max-nocall-number {params.missing} -O {output.vcf} 2> {log}"
I just checked, and the VCF file that is produced directly from the gatk GenotypeGVCFs step does produce invariant sites that are not missing all genotypes that don't have any QUAL scores. So I don't believe that the issue is coming from downstream filtering.
Here are a few more example sites from the "all.recal.vcf.gz" file produced above:
HiC_scaffold_1 455456 . G . . . DP=257 GT:DP:RGQ 0/0:50:99 0/0:25:75 ./.:0:0 0/0:1:3 0/0:25:74 ./.:0:0 ./.:0:0 0/0:29:87 ./.:0:0 ./.:0:0 0/0:5:15 0/0:15:45 ./.:0:0 ./.:0:0 0/0:17:51 0/0:24:72 0/0:22:66 ./.:0:0 0/0:9:27 0/0:1:3 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 0/0:9:27 0/0:25:75 ./.:0:0
HiC_scaffold_1 455457 . A . . . DP=257 GT:DP:RGQ 0/0:50:99 0/0:25:75 ./.:0:0 0/0:1:3 0/0:25:74 ./.:0:0 ./.:0:0 0/0:29:87 ./.:0:0 ./.:0:0 0/0:5:15 0/0:15:45 ./.:0:0 ./.:0:0 0/0:17:51 0/0:24:57 0/0:22:66 ./.:0:0 0/0:9:27 0/0:1:3 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 0/0:9:27 0/0:25:75 ./.:0:0
HiC_scaffold_1 455458 . G . . . DP=257 GT:DP:RGQ 0/0:50:99 0/0:25:75 ./.:0:0 0/0:1:3 0/0:25:74 ./.:0:0 ./.:0:0 0/0:29:87 ./.:0:0 ./.:0:0 0/0:5:15 0/0:15:45 ./.:0:0 ./.:0:0 0/0:17:35 0/0:24:72 0/0:22:66 ./.:0:0 0/0:9:27 0/0:1:3 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 0/0:9:27 0/0:25:75 ./.:0:0
HiC_scaffold_1 455459 . G . . . DP=257 GT:DP:RGQ 0/0:50:99 0/0:25:75 ./.:0:0 0/0:1:3 0/0:25:74 ./.:0:0 ./.:0:0 0/0:29:87 ./.:0:0 ./.:0:0 0/0:5:15 0/0:15:45 ./.:0:0 ./.:0:0 0/0:17:51 0/0:24:72 0/0:22:66 ./.:0:0 0/0:9:27 0/0:1:3 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 0/0:9:27 0/0:25:75 ./.:0:0
HiC_scaffold_1 455460 . A . . . DP=257 GT:DP:RGQ 0/0:50:99 0/0:25:75 ./.:0:0 0/0:1:3 0/0:25:74 ./.:0:0 ./.:0:0 0/0:29:87 ./.:0:0 ./.:0:0 0/0:5:15 0/0:15:45 ./.:0:0 ./.:0:0 0/0:17:51 0/0:24:72 0/0:22:66 ./.:0:0 0/0:9:27 0/0:1:3 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 0/0:9:27 0/0:25:75 ./.:0:0
HiC_scaffold_1 455461 . C . . . DP=257 GT:DP:RGQ 0/0:50:99 0/0:25:75 ./.:0:0 0/0:1:3 0/0:25:74 ./.:0:0 ./.:0:0 0/0:29:87 ./.:0:0 ./.:0:0 0/0:5:15 0/0:15:45 ./.:0:0 ./.:0:0 0/0:17:51 0/0:24:72 0/0:22:66 ./.:0:0 0/0:9:27 0/0:1:3 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 0/0:9:27 0/0:25:75 ./.:0:0
HiC_scaffold_1 455462 . T . . . DP=257 GT:DP:RGQ 0/0:50:99 0/0:25:75 ./.:0:0 0/0:1:3 0/0:25:74 ./.:0:0 ./.:0:0 0/0:29:87 ./.:0:0 ./.:0:0 0/0:5:15 0/0:15:45 ./.:0:0 ./.:0:0 0/0:17:51 0/0:24:72 0/0:22:66 ./.:0:0 0/0:9:27 0/0:1:3 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 0/0:9:27 0/0:25:75 ./.:0:0
HiC_scaffold_1 455463 . C . . . DP=257 GT:DP:RGQ 0/0:50:99 0/0:25:75 ./.:0:0 0/0:1:3 0/0:25:74 ./.:0:0 ./.:0:0 0/0:29:87 ./.:0:0 ./.:0:0 0/0:5:15 0/0:15:45 ./.:0:0 ./.:0:0 0/0:17:51 0/0:24:72 0/0:22:66 ./.:0:0 0/0:9:27 0/0:1:3 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 0/0:9:27 0/0:25:75 ./.:0:0
HiC_scaffold_1 455464 . A . . . DP=257 GT:DP:RGQ 0/0:50:99 0/0:25:75 ./.:0:0 0/0:1:3 0/0:25:74 ./.:0:0 ./.:0:0 0/0:29:87 ./.:0:0 ./.:0:0 0/0:5:15 0/0:15:45 ./.:0:0 ./.:0:0 0/0:17:51 0/0:24:72 0/0:22:66 ./.:0:0 0/0:9:27 0/0:1:3 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 0/0:9:27 0/0:25:75 ./.:0:0
HiC_scaffold_1 455465 . C . . . DP=257 GT:DP:RGQ 0/0:50:99 0/0:25:75 ./.:0:0 0/0:1:3 0/0:25:74 ./.:0:0 ./.:0:0 0/0:29:87 ./.:0:0 ./.:0:0 0/0:5:15 0/0:15:45 ./.:0:0 ./.:0:0 0/0:17:51 0/0:24:72 0/0:22:66 ./.:0:0 0/0:9:27 0/0:1:3 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 0/0:9:27 0/0:25:75 ./.:0:0
In any case, I've already gotten around this by separately filtering the invariant and variant sites, and then concatenating the two data sets back together before running pixy. That solved the problem for me, and I got much more sensible results. If you don't feel like worrying about this, I certainly don't mind. Best, Dave
Awesome, thanks for the details! There really doesn't seem to be anything out of the ordinary in your pipeline, so I am not sure what's up. The only thing that I can think would be connected is the --max-nocall-number
flag in SelectVariants, but I'm not 100% sure how that one behaves. Anyway, I'm going to add a note in the docs about this, along with your solution, just in case someone else runs into it (I'll leave this issue open until that is done). Thanks again!
I don't think --max-nocall-number
is responsible as I'm seeing the same problem with the VCF produced directly from GenotypeGVCFs, without any subsequent filtering or variant selection. But in any case, thanks for taking a look!
Dave
Ah OK, right, I missed that in your last message. Out of curiosity, when you say current version of GATK, do you mean 4.2.0.0? Seems like this might have been a recent(ish) change.
Yep. v4.2.0.0 of GATK is the one I used.
Great, will have a look, thanks again for reporting this.
I've added a note about this strange issue to the docs for posterity: https://pixy.readthedocs.io/en/latest/guide/pixy_guide.html
Thanks for your help!
Deary pixy developers, Thanks for this great tool! I wanted to just offer a comment on the site-level filtration guide.
I noticed that when I followed the site filtering guide, the vast majority of my invariant sites were dropped from the VCF. This had the effect of drastically inflating the estimates of pi. Here is the filtering command I ran (mostly following the guide):
I included invariant sites in my jointly genotyped data set with the current version of GATK GenotypeGVCFs tool using the
--include-non-variant-sites
flag.I believe what is happening is that GATK GenotypeGVCFs only assigns QUAL scores to invariant sites if there are no missing genotypes. Otherwise, the variant site receives no QUAL score. For example, here is an invariant site that does have a QUAL score:
HiC_scaffold_1 937017 . G . 181.42 PASS DP=75;MLEAC=.;MLEAF=. GT:DP:RGQ 0/0:15:45 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:6:18 0/0:0:0 0/0:0:0 0/0:0:0 0/0:4:12 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:2:6 0/0:0:0 0/0:2:6 0/0:3:9 0/0:1:3 0/0:0:0 0/0:0:0 0/0:3:9 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:1:3 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0
And here is one that doesn't:
HiC_scaffold_1 567032 . T . . PASS DP=100 GT:DP:RGQ 0/0:18:54 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 0/0:11:33 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:00/0:23:69 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 0/0:6:18 ./.:0:0 0/0:1:3 0/0:2:6 0/0:6:18 0/0:7:21 0/0:5:15 0/0:5:15 ./.:0:0 0/0:1:3 0/0:3:9 0/0:1:3 0/0:2:6 0/0:4:12 0/0:4:12 ./.:0:0 ./.:0:0 0/0:1:3
Thus, when I applied the site filters to my VCF file, it dropped any invariant sites that had any missing genotypes, and I was left with a VCF file that had comparatively fewer invariant compared to variant sites. And this threw off my pi calculations (among other things).
I think the simplest solution is to just filter the variant and invariant sites separately, similarly to what is suggested in the guide for the optional population genetic filters.
But since this is not currently what the guide is recommending, I just wanted to let you know about this.
Feel free to let me know if you have any questions, or if I am missing something! Thanks! Dave