populationgenomics / joint-calling

Sample and variant QC, based on https://github.com/broadinstitute/gnomad_qc
MIT License
1 stars 2 forks source link

Adding VQSR to the QC pipeline #6

Closed illusional closed 3 years ago

illusional commented 3 years ago

This issue is primarily about pulling required documentation together for this issue in one place.

Goal

To perform filtering through variant quality score recalibration to ensure high quality variants

Background

Variant quality score recalibration occurs in two steps:

  1. Build a recalibration model to score variant quality
  2. Apply score cutoff to filter variants based on the previously generated recalibration model.

Definitions

Theory

  1. Build a recalibration model to score variant quality:

    • Takes an overlap of the training/truth resource sets and of your callset.
    • Use the characteristic of the true positives within your callset to build a random forest gaussian mixture model to find good variants within the callset.
    • Use this statistical model to for each variant calculate a VQSLOD score, and add this into the INFO field.
  2. Apply score cutoff to filter variants based on the previously generated recalibration model.

Random forest model

Edit: this section was added on March 1st

As Peter pointed out below, the current GATK variant recalibration (in 4.1) builds a Gaussian mixture model, the gnomAD team use a random forest model (what is a random forest model).

The Random Forest model the gnomAD team developed is available in GitHub: broadinstitute/gnomad_methods:variant_qc/random_forest.py, and API docs.

This gnomad_methods RF pair (train_rf and apply_rf_model) would likely replace the previous pair of gatk VariantRecalibration and ApplyVQSR. Here's a paper that discussed the effectiveness of RFs: ForestQC: Quality control on genetic variants from next-generation sequencing data using random forest, however there was internal conversation that suggested that RFs may be worse at detecting low quality variants than original gaussian models.

Other notes

As variant recalibration isn't native to Hail, from the matrix table, we need to export a sites-only (to reduce size) VCF to perform this external analysis (sites-only to reduce space.

SNPs and indels must be recalibrated in separate runs.

We'll use the allele-specific version of the recalibrator.

Points to clarify

Analysis

We're taking at least portions of an existing WDL workflow that runs these steps.

image

Note, for performance reasons the SNPsVariantRecalibrator is ran twice, once to create the model, and then again in parallel (with GatherTranches -> ApplyVQSR run later). I don't see this documented in the docs for GATK 4.1.9 VariantRecalibrator, but there is documentation about it in the original java code:

This argument is intended to be used in a more complicated VQSR scheme meant for very large WGS callsets that require a prohibitive amount of memory for classic VQSR. Given that training data is downsampled once it exceeds --max-num-training-data, reading in additional data to build the model only serves to consume resources. However, with this argument the output recal file will also be downsampled. The recommended VQSR procedure when using this argument is to run VariantRecalibrator once with sampling and designate an --output-model file. Then VariantRecalibrator can be run a second time scattered using the -scatterTranches argument and that file as an --input-model. The scattered recal files can be gathered with the GatherVcfs tool and the scattered tranches can be gathered with the GatherTranches tool.

Source: https://github.com/broadinstitute/gatk/blob/ae06fb7345dd8bdef5e5dc0364416e81ab3622ec/src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/VariantRecalibrator.java#L280-L296

Params:

Sources

pdiakumis commented 3 years ago

This is great Michael! Just to note, there's a mention of the RF classifier used for gnomAD that we spoke about last week in https://macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/, and also in the gnomAD paper's supplementary information.