Closed illusional closed 3 years ago
This is great Michael! Just to note, there's a mention of the RF classifier used for gnomAD that we spoke about last week in https://macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/, and also in the gnomAD paper's supplementary information.
This issue is primarily about pulling required documentation together for this issue in one place.
Goal
To perform filtering through variant quality score recalibration to ensure high quality variants
Background
Variant quality score recalibration occurs in two steps:
Definitions
$THRESHOLD
percentage of variants in the truth set are includedTheory
Build a recalibration model to score variant quality:
random forestgaussian mixture model to find good variants within the callset.Apply score cutoff to filter variants based on the previously generated recalibration model.
Random forest model
Edit: this section was added on March 1st
As Peter pointed out below, the current GATK variant recalibration (in 4.1) builds a Gaussian mixture model, the gnomAD team use a random forest model (what is a random forest model).
The Random Forest model the gnomAD team developed is available in GitHub: broadinstitute/gnomad_methods:variant_qc/random_forest.py, and API docs.
This gnomad_methods RF pair (
train_rf
andapply_rf_model
) would likely replace the previous pair ofgatk
VariantRecalibration
andApplyVQSR
. Here's a paper that discussed the effectiveness of RFs: ForestQC: Quality control on genetic variants from next-generation sequencing data using random forest, however there was internal conversation that suggested that RFs may be worse at detecting low quality variants than original gaussian models.Other notes
As variant recalibration isn't native to Hail, from the matrix table, we need to export a sites-only (to reduce size) VCF to perform this external analysis (sites-only to reduce space.
SNPs and indels must be recalibrated in separate runs.
We'll use the allele-specific version of the recalibrator.
Points to clarify
Analysis
We're taking at least portions of an existing WDL workflow that runs these steps.
Note, for performance reasons the SNPsVariantRecalibrator is ran twice, once to create the model, and then again in parallel (with GatherTranches -> ApplyVQSR run later). I don't see this documented in the docs for GATK 4.1.9 VariantRecalibrator, but there is documentation about it in the original java code:
Source: https://github.com/broadinstitute/gatk/blob/ae06fb7345dd8bdef5e5dc0364416e81ab3622ec/src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/VariantRecalibrator.java#L280-L296
Params:
--output-model
sample-every-Nth-variant
output-tranches-for-scatter
/-scatterTranches
Sources