ksamuk / pixy

Software for painlessly estimating average nucleotide diversity within and between populations
https://pixy.readthedocs.io/
MIT License
117 stars 14 forks source link

Filtering on site quality removes many invariant sites in a VCF from GATK #37

Closed davidecarlson closed 3 years ago

davidecarlson commented 3 years ago

Deary pixy developers, Thanks for this great tool! I wanted to just offer a comment on the site-level filtration guide.

I noticed that when I followed the site filtering guide, the vast majority of my invariant sites were dropped from the VCF. This had the effect of drastically inflating the estimates of pi. Here is the filtering command I ran (mostly following the guide):

#!/usr/bin/env bash

# Apply site filters to VCF

MISSING=0.8
QUAL=30
MINDP=5
MAXDP=100

vcftools --gzvcf myvariants.vcf.gz \
--remove-indels \
--remove-filtered-all \
--minQ $QUAL \
--max-missing $MISSING \
--min-meanDP $MINDP \
--max-meanDP $MAXDP \
--recode --stdout | gzip -c > myvariants.filtered.vcf.gz

I included invariant sites in my jointly genotyped data set with the current version of GATK GenotypeGVCFs tool using the --include-non-variant-sites flag.

I believe what is happening is that GATK GenotypeGVCFs only assigns QUAL scores to invariant sites if there are no missing genotypes. Otherwise, the variant site receives no QUAL score. For example, here is an invariant site that does have a QUAL score:

HiC_scaffold_1 937017 . G . 181.42 PASS DP=75;MLEAC=.;MLEAF=. GT:DP:RGQ 0/0:15:45 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:6:18 0/0:0:0 0/0:0:0 0/0:0:0 0/0:4:12 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:2:6 0/0:0:0 0/0:2:6 0/0:3:9 0/0:1:3 0/0:0:0 0/0:0:0 0/0:3:9 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 0/0:1:3 0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0

And here is one that doesn't:

HiC_scaffold_1 567032 . T . . PASS DP=100 GT:DP:RGQ 0/0:18:54 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 0/0:11:33 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:00/0:23:69 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 0/0:6:18 ./.:0:0 0/0:1:3 0/0:2:6 0/0:6:18 0/0:7:21 0/0:5:15 0/0:5:15 ./.:0:0 0/0:1:3 0/0:3:9 0/0:1:3 0/0:2:6 0/0:4:12 0/0:4:12 ./.:0:0 ./.:0:0 0/0:1:3

Thus, when I applied the site filters to my VCF file, it dropped any invariant sites that had any missing genotypes, and I was left with a VCF file that had comparatively fewer invariant compared to variant sites. And this threw off my pi calculations (among other things).

I think the simplest solution is to just filter the variant and invariant sites separately, similarly to what is suggested in the guide for the optional population genetic filters.

But since this is not currently what the guide is recommending, I just wanted to let you know about this.

Feel free to let me know if you have any questions, or if I am missing something! Thanks! Dave

ksamuk commented 3 years ago

Hi Dave,

Thanks for your message. That is indeed a strange problem. I just went back and double checked, and have not encountered this issue in any of the GATK-generated datasets we've been working with. Invariant sites tend to have (much) lower QUAL than variant sites, but I am seeing plenty of invariant sites with missing genotypes that still have QUAL scores. I could be missing something, but looks like the only case where there is not a QUAL score is if ALL the genotypes are missing at a site.

Would you be willing to share some info about your upstream GATK pipeline? Is it possible that there was some pre-filtering done to the VCF that dropped those QUAL scores? I know VCFtools and other tools will sometimes recalculate site-level scores unless you specify otherwise.

Let me know,

Kieran

davidecarlson commented 3 years ago

Hi Kieran, Thanks for your response! My pipeline does have a few steps after the joint genotyping. It's a snakemake pipeline. Here are the relevant rules:

# jointly genotyping all of the samples together
rule genotype_rnd2:
        input:
            gvcf=config['results_loc']+"/gvcf/all.recal.g.vcf.gz",
            ref=REF
        output:
            vcf=config['results_loc']+"/vcfs/all.recal.vcf.gz"
        log:
            config['results_loc']+"/logs/jointgenotype/genotype.rnd2.log"
        params:
            java_opts="-Xmx12G",
            extra="--include-non-variant-sites true"
        wrapper:
            "0.50.4/bio/gatk/genotypegvcfs"

# apply some of the standard filters based partly on the BROAD recommended values
rule filter_rnd2:
        input:
            vcf=config['results_loc']+"/vcfs/all.recal.vcf.gz",
            ref=REF
        output:
            vcf=config['results_loc']+"/vcfs/all.recal.filtered.vcf.gz"
        log:
            config['results_loc']+"/logs/jointgenotype/filter.rnd2.log"
        params:
            java_opts="-Xmx12G",
            filter1="QD_filter",
            filter1exp="QD < 3.0",
            filter2="FS_filter",
            filter2exp=" FS > 50.0",
            filter3="SOR_filter",
            filter3exp="SOR > 3.0",
            filter4="MQ_filter",
            filter4exp="MQ < 40.0",
            filter5="MQRS_filter",
            filter5exp="MQRankSum < -8.0",
            filter6="RPRS_filter",
            filter6exp="ReadPosRankSum < -8.0",
        shell:
            "gatk --java-options {params.java_opts} VariantFiltration -R {input.ref} -V {input.vcf} -O {output.vcf} "
            "--filter-name '{params.filter1}' --filter-expression '{params.filter1exp}' "
            "--filter-name '{params.filter2}' --filter-expression '{params.filter2exp}' "
            "--filter-name '{params.filter3}' --filter-expression '{params.filter3exp}' "
            "--filter-name '{params.filter4}' --filter-expression '{params.filter4exp}' "
            "--filter-name '{params.filter5}' --filter-expression '{params.filter5exp}' "
            "--filter-name '{params.filter6}' --filter-expression '{params.filter6exp}' 2> {log}"

# select sites where there is at least one genotype call
MAX_MISSING = len(SAMPLES) - 1

rule select_rnd2:
        input:
            vcf=config['results_loc']+"/vcfs/all.recal.filtered.vcf.gz",
            ref=REF
        output:
            vcf=config['results_loc']+"/vcfs/all.recal.filtered.min1geno.vcf.gz"
        log:
            config['results_loc']+"/logs/selectvariants/select.rnd2.log"
        params:
            java_opts="-Xmx12G",
            missing=MAX_MISSING
        shell:
                "gatk --java-options {params.java_opts} SelectVariants -R {input.ref} -V {input.vcf} --max-nocall-number {params.missing} -O {output.vcf}  2> {log}"

I just checked, and the VCF file that is produced directly from the gatk GenotypeGVCFs step does produce invariant sites that are not missing all genotypes that don't have any QUAL scores. So I don't believe that the issue is coming from downstream filtering.

Here are a few more example sites from the "all.recal.vcf.gz" file produced above:

HiC_scaffold_1  455456  .       G       .       .       .       DP=257  GT:DP:RGQ       0/0:50:99       0/0:25:75       ./.:0:0 0/0:1:3 0/0:25:74       ./.:0:0 ./.:0:0 0/0:29:87       ./.:0:0 ./.:0:0 0/0:5:15 0/0:15:45       ./.:0:0 ./.:0:0 0/0:17:51       0/0:24:72       0/0:22:66       ./.:0:0 0/0:9:27        0/0:1:3 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0  ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 0/0:9:27        0/0:25:75       ./.:0:0
HiC_scaffold_1  455457  .       A       .       .       .       DP=257  GT:DP:RGQ       0/0:50:99       0/0:25:75       ./.:0:0 0/0:1:3 0/0:25:74       ./.:0:0 ./.:0:0 0/0:29:87       ./.:0:0 ./.:0:0 0/0:5:15 0/0:15:45       ./.:0:0 ./.:0:0 0/0:17:51       0/0:24:57       0/0:22:66       ./.:0:0 0/0:9:27        0/0:1:3 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0  ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 0/0:9:27        0/0:25:75       ./.:0:0
HiC_scaffold_1  455458  .       G       .       .       .       DP=257  GT:DP:RGQ       0/0:50:99       0/0:25:75       ./.:0:0 0/0:1:3 0/0:25:74       ./.:0:0 ./.:0:0 0/0:29:87       ./.:0:0 ./.:0:0 0/0:5:15 0/0:15:45       ./.:0:0 ./.:0:0 0/0:17:35       0/0:24:72       0/0:22:66       ./.:0:0 0/0:9:27        0/0:1:3 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0  ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 0/0:9:27        0/0:25:75       ./.:0:0
HiC_scaffold_1  455459  .       G       .       .       .       DP=257  GT:DP:RGQ       0/0:50:99       0/0:25:75       ./.:0:0 0/0:1:3 0/0:25:74       ./.:0:0 ./.:0:0 0/0:29:87       ./.:0:0 ./.:0:0 0/0:5:15 0/0:15:45       ./.:0:0 ./.:0:0 0/0:17:51       0/0:24:72       0/0:22:66       ./.:0:0 0/0:9:27        0/0:1:3 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0  ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 0/0:9:27        0/0:25:75       ./.:0:0
HiC_scaffold_1  455460  .       A       .       .       .       DP=257  GT:DP:RGQ       0/0:50:99       0/0:25:75       ./.:0:0 0/0:1:3 0/0:25:74       ./.:0:0 ./.:0:0 0/0:29:87       ./.:0:0 ./.:0:0 0/0:5:15 0/0:15:45       ./.:0:0 ./.:0:0 0/0:17:51       0/0:24:72       0/0:22:66       ./.:0:0 0/0:9:27        0/0:1:3 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0  ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 0/0:9:27        0/0:25:75       ./.:0:0
HiC_scaffold_1  455461  .       C       .       .       .       DP=257  GT:DP:RGQ       0/0:50:99       0/0:25:75       ./.:0:0 0/0:1:3 0/0:25:74       ./.:0:0 ./.:0:0 0/0:29:87       ./.:0:0 ./.:0:0 0/0:5:15 0/0:15:45       ./.:0:0 ./.:0:0 0/0:17:51       0/0:24:72       0/0:22:66       ./.:0:0 0/0:9:27        0/0:1:3 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0  ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 0/0:9:27        0/0:25:75       ./.:0:0
HiC_scaffold_1  455462  .       T       .       .       .       DP=257  GT:DP:RGQ       0/0:50:99       0/0:25:75       ./.:0:0 0/0:1:3 0/0:25:74       ./.:0:0 ./.:0:0 0/0:29:87       ./.:0:0 ./.:0:0 0/0:5:15 0/0:15:45       ./.:0:0 ./.:0:0 0/0:17:51       0/0:24:72       0/0:22:66       ./.:0:0 0/0:9:27        0/0:1:3 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0  ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 0/0:9:27        0/0:25:75       ./.:0:0
HiC_scaffold_1  455463  .       C       .       .       .       DP=257  GT:DP:RGQ       0/0:50:99       0/0:25:75       ./.:0:0 0/0:1:3 0/0:25:74       ./.:0:0 ./.:0:0 0/0:29:87       ./.:0:0 ./.:0:0 0/0:5:15 0/0:15:45       ./.:0:0 ./.:0:0 0/0:17:51       0/0:24:72       0/0:22:66       ./.:0:0 0/0:9:27        0/0:1:3 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0  ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 0/0:9:27        0/0:25:75       ./.:0:0
HiC_scaffold_1  455464  .       A       .       .       .       DP=257  GT:DP:RGQ       0/0:50:99       0/0:25:75       ./.:0:0 0/0:1:3 0/0:25:74       ./.:0:0 ./.:0:0 0/0:29:87       ./.:0:0 ./.:0:0 0/0:5:15 0/0:15:45       ./.:0:0 ./.:0:0 0/0:17:51       0/0:24:72       0/0:22:66       ./.:0:0 0/0:9:27        0/0:1:3 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0  ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 0/0:9:27        0/0:25:75       ./.:0:0
HiC_scaffold_1  455465  .       C       .       .       .       DP=257  GT:DP:RGQ       0/0:50:99       0/0:25:75       ./.:0:0 0/0:1:3 0/0:25:74       ./.:0:0 ./.:0:0 0/0:29:87       ./.:0:0 ./.:0:0 0/0:5:15 0/0:15:45       ./.:0:0 ./.:0:0 0/0:17:51       0/0:24:72       0/0:22:66       ./.:0:0 0/0:9:27        0/0:1:3 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0  ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 0/0:9:27        0/0:25:75       ./.:0:0

In any case, I've already gotten around this by separately filtering the invariant and variant sites, and then concatenating the two data sets back together before running pixy. That solved the problem for me, and I got much more sensible results. If you don't feel like worrying about this, I certainly don't mind. Best, Dave

ksamuk commented 3 years ago

Awesome, thanks for the details! There really doesn't seem to be anything out of the ordinary in your pipeline, so I am not sure what's up. The only thing that I can think would be connected is the --max-nocall-number flag in SelectVariants, but I'm not 100% sure how that one behaves. Anyway, I'm going to add a note in the docs about this, along with your solution, just in case someone else runs into it (I'll leave this issue open until that is done). Thanks again!

davidecarlson commented 3 years ago

I don't think --max-nocall-number is responsible as I'm seeing the same problem with the VCF produced directly from GenotypeGVCFs, without any subsequent filtering or variant selection. But in any case, thanks for taking a look! Dave

ksamuk commented 3 years ago

Ah OK, right, I missed that in your last message. Out of curiosity, when you say current version of GATK, do you mean 4.2.0.0? Seems like this might have been a recent(ish) change.

davidecarlson commented 3 years ago

Yep. v4.2.0.0 of GATK is the one I used.

ksamuk commented 3 years ago

Great, will have a look, thanks again for reporting this.

ksamuk commented 3 years ago

I've added a note about this strange issue to the docs for posterity: https://pixy.readthedocs.io/en/latest/guide/pixy_guide.html

Thanks for your help!