Adding VQSR to the QC pipeline

This issue is primarily about pulling required documentation together for this issue in one place.

Goal

To perform filtering through variant quality score recalibration to ensure high quality variants

Background

Variant quality score recalibration occurs in two steps:

Build a recalibration model to score variant quality
Apply score cutoff to filter variants based on the previously generated recalibration model.

Definitions

A VQSLOD (variant quality score log-odds) is a multi-dimensional score for a variant.
The tranche sensitivity threshold determines the maximum VQSLOD score for which $THRESHOLD percentage of variants in the truth set are included
A tranche is a partition of variants (buckets) with each bucket having an increasing minimum sensitivity threshold.

Theory

Build a recalibration model to score variant quality:
- Takes an overlap of the training/truth resource sets and of your callset.
- Use the characteristic of the true positives within your callset to build a ~~random forest~~ gaussian mixture model to find good variants within the callset.
- Use this statistical model to for each variant calculate a VQSLOD score, and add this into the INFO field.
Apply score cutoff to filter variants based on the previously generated recalibration model.

Random forest model

Edit: this section was added on March 1st

As Peter pointed out below, the current GATK variant recalibration (in 4.1) builds a Gaussian mixture model, the gnomAD team use a random forest model (what is a random forest model).

The Random Forest model the gnomAD team developed is available in GitHub: broadinstitute/gnomad_methods:variant_qc/random_forest.py, and API docs.

This gnomad_methods RF pair (train_rf and apply_rf_model) would likely replace the previous pair of gatk VariantRecalibration and ApplyVQSR. Here's a paper that discussed the effectiveness of RFs: ForestQC: Quality control on genetic variants from next-generation sequencing data using random forest, however there was internal conversation that suggested that RFs may be worse at detecting low quality variants than original gaussian models.

Other notes

As variant recalibration isn't native to Hail, from the matrix table, we need to export a sites-only (to reduce size) VCF to perform this external analysis (sites-only to reduce space.

SNPs and indels must be recalibrated in separate runs.

We'll use the allele-specific version of the recalibrator.

Points to clarify

Do we need the model reporting plots (we need Rscript in the path if so)?
Should we discard the variants that do NOT pass the quality score recalibration?
What tranche sensitivity threshold should we use, IE 99.0%?

Analysis

We're taking at least portions of an existing WDL workflow that runs these steps.

Note, for performance reasons the SNPsVariantRecalibrator is ran twice, once to create the model, and then again in parallel (with GatherTranches -> ApplyVQSR run later). I don't see this documented in the docs for GATK 4.1.9 VariantRecalibrator, but there is documentation about it in the original java code:

This argument is intended to be used in a more complicated VQSR scheme meant for very large WGS callsets that require a prohibitive amount of memory for classic VQSR. Given that training data is downsampled once it exceeds --max-num-training-data, reading in additional data to build the model only serves to consume resources. However, with this argument the output recal file will also be downsampled. The recommended VQSR procedure when using this argument is to run VariantRecalibrator once with sampling and designate an --output-model file. Then VariantRecalibrator can be run a second time scattered using the -scatterTranches argument and that file as an --input-model. The scattered recal files can be gathered with the GatherVcfs tool and the scattered tranches can be gathered with the GatherTranches tool.

Source: https://github.com/broadinstitute/gatk/blob/ae06fb7345dd8bdef5e5dc0364416e81ab3622ec/src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/VariantRecalibrator.java#L280-L296

Params:

--output-model
sample-every-Nth-variant
output-tranches-for-scatter / -scatterTranches
maybe more...

Sources

https://gatk.broadinstitute.org/hc/en-us/articles/360035531612
VariantRecalibrator: https://gatk.broadinstitute.org/hc/en-us/articles/360036510892-VariantRecalibrator
ApplyVQSR: https://gatk.broadinstitute.org/hc/en-us/articles/360037056912-ApplyVQSR
A guide to GATK4 best practice pipeline performance and optimization: https://www.ibm.com/downloads/cas/ZJQD0QAL
Genome Aggregation Database (gnomAD): https://macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/

populationgenomics / joint-calling