jmonlong / sveval

Functions to compare a SV call sets against a truth set.
MIT License
26 stars 6 forks source link

sveval

Build Status codecov GitHub release (latest by date) Docker Repository on Quay

Functions to compare a SV call sets against a truth set. This package uses mostly overlap-based metrics, although for insertions it can align the inserted sequences to match variants. It uses:

  1. Installation
  2. Usage
    1. Quickstart
    2. Genotype evaluation
    3. Wiggle room in simple repeats
    4. Evaluation per size or per region
    5. Precision-recall curve comparing multiple methods
    6. Misc functions for frequency annotation, SV clustering, VCF parsing, repeat annotation in R.
    7. Snakemake pipeline
  3. Methods
  4. Docker
  5. Interactive exploration of the results
    1. Interactive exploration of FPs/FNs/TPs
    2. Interactive exploration of SVs in a variation graph
  6. Reading a VCF with SVs
  7. BND and translocations

Installation

Install using the Bioconductor installer:

BiocManager::install('jmonlong/sveval')

You might need to install devtools, remotes and BiocManager first:

install.packages(c('devtools', 'remotes', 'BiocManager'))

Usage

Quickstart

A "command line" option is to run the wrapper function used by the Snakemake pipeline section. For example:

Rscript -e  "sveval::wrapper('sveval')" -calls calls.vcf.gz -truth truth.vcf.gz -sample SAMPNAME -region conf.bed -simprep simpleRepeat.bed.gz -output sveval_out

More information in the wrapper command-line section.

Otherwise, run interactively in R (or in a script):

library(sveval)
eval.o = svevalOl('calls.vcf', 'truth.vcf')
eval.o$eval # data.frame with results using all variants
plot_prcurve(eval.o$curve)
eval.o$INS$fp # a GRanges object with false-positives insertions

Outputs a list with a data.frame with TP, FP, TN, precision, recall and F1 for all variants and for each SV type, and a another data.frame with the results using increasing quality thresholds to make a precision-recall curve.

Some of the most important other parameters:

See full list of parameters in the manual or by typing ?svevalOl in R.

Genotype evaluation

By default sveval doesn't take the genotype into account, it's more a "calling" evaluation than a "genotyping" evaluation. To compare genotype, the evaluation can be performed separately for heterozygous and homozygous variants. Before doing that it sometimes helps to merge very similar hets into homs. To a lower extent, it also helps to stitch fragmented hets before trying to merge them into homs. When comparing genotypes we'd rather match variants 1-to-1 instead of using the cumulative coverage metrics. It's not about testing if a SV was called in the region but to make sure the actual genotype in the region is correct. This penalizes "over-genotyping": genotyping the near-duplicate variants multiple times. The relevant parameters in svevalOl are:

Hence, the recommended command for genotype evaluation:

eval.o = svevalOl('calls.vcf', 'truth.vcf', geno.eval=TRUE, method="bipartite", stitch.hets=TRUE, merge.hets=TRUE)

Wiggle room in simple repeats

To help match SVs within simple repeats, we can provide information about annotated simple repeats in the genome. sveval will allow more wiggle room, i.e. for a SV to "move" along a simple repeat segment. It helps matching similar simple repeat variants that are just positioned differently. For example, a similar deletion might be called at the beginning of the annotated repeat but positioned at the end in the truthset. And because simple repeats are not perfect and the SVs exactly the same, left-aligning might not handle these cases.

A GRanges object can be provided to svevalOl with simprep=.

We've prepared BED files for GRCh38 and GRCh37, see docs for more details.

Evaluation per size or per region

The evaluation can already be narrowed down to a size range and a set of regions (min.size=/max.size=/bed.regions= in svevalOl). It's also interesting to split the result into (many) different SV classes without having to rerun svevalOl. Two functions are provided to take the output of svevalOl and compute the evaluation metrics per size class or per region: plot_persize and plot_perregion. Of note, there is no exploration of the calls' quality like for PR curves. The new metrics are computed from the set of TP/FP/FN as defined by the "best" run (maximum F1 score).

eval.o = svevalOl('calls.vcf', 'truth.vcf')
plot_persize(eval.o)
regs = GRanges(...)
plot_perregion(eval.o, regs)

Both functions return a list of ggplot2 graphs. If using plot=FALSE they will return a data.frame.

Precision-recall curve comparing multiple methods

eval.1 = svevalOl('calls1.vcf', 'truth.vcf')
eval.2 = svevalOl('calls2.vcf', 'truth.vcf')
plot_prcurve(list(eval.1$curve, eval.2$curve), labels=c('method1', 'method2'))

Or if the results were written in files:

plot_prcurve(c('methods1-prcurve.tsv', 'methods2-prcurve.tsv'), labels=c('method1', 'method2'))

Miscellaneous functions

Frequency annotation

Assuming that we have a SV catalog with a field with frequency estimates, we can overlap called SVs and annotate them with the maximum frequency of overlapping SVs in the catalog.

For example:

freqAnnotate('calls.vcf', 'gnomad.vcf', out.vcf='calls.withFreq.vcf')

Variant clustering

SVs can be clustered (and merged) into sites using the same matching logic as when comparing a callset to a truthset. In practice, SV sites might be easier to analyze when comparing SVs across multiple samples. For example, we might want to know that multiple samples carry an insertion, even if the insertions are slightly different due to internal SNVs.

The relevant function in the sveval package is clusterSVs. See the manual for more details on the arguments, and the example below on how to read, cluster and count alleles in SV sites.

VCF parsing

The sveval package has some functions to efficiently read SVs from a (potentially large) VCF. Rather than reading the whole VCF with existing packages and filtering/parsing SVs, it does it while reading the file which helpe keeping the memory/time low. See the manual for the following functions:

For example, starting from a VCF with SVs for multiple samples:

## read VCF
svs.gr = readSVvcf.multisamps("input.vcf.gz")

## cluster similar SVs into SV sites
sv.sites = clusterSVs(svs.gr)

## make matrix of allele counts for each sample and each SV site
ac.mat = countAlleles("input.vcf.gz", sv.sites)

Repeat masker annotation

rmskAnnotate is a helper function to run RepeatMasker on the REF/ALT sequences from a VCF. See the manual for more details.

Wrapper command-line

To use sveval as a command line and avoid interactive usage of R, see the wrapper function used by the Snakemake pipeline documented below. Briefly, to compare a call set to a truthset, run something like:

Rscript -e  "sveval::wrapper('sveval')" -calls calls.vcf.gz -truth truth.vcf.gz -sample SAMPNAME -region conf.bed -simprep simpleRepeat.bed.gz -output sveval_out

This command will output three files:

  1. A sveval_out-prcurve.tsv TSV file with the evaluation metrics for each quality threshold. Useful to plot a precision-recall curve. The first rows (qual=0) correspond to using no quality threshold, i.e. the metrics when using all SV calls.
  2. A sveval_out-persize.tsv TSV file with the evaluation metrics for SVs grouped into a few size classes.
  3. A sveval_out-sveval.RData file with the results as a R object that can be loaded in R, for example to make graphs or explore the results interactively.

The arguments are detailed by running Rscript -e "sveval::wrapper('sveval')":

Evaluate SV calls against a truthset
    -c, -calls  input VCF with the SV calls
    -t, -truth  input VCF with the truth SVs
    -s, -sample sample name
    -o, -output output prefix for Rdata and TSV files
    -e, -eval   Optional. evaluation type: 'call' for absence/presence, 'geno' if genotypes have to match. Default: call
    -r, -region Optional. regions of interest. NA means whole genome. Default: NA
    -p, -simprep    Optional. simple repeat track. NA means disabled. Default: NA
    -i, -inversion  Optional. look for inversions? Default: FALSE
    -m, -minol  Optional. minimum overlap to match variants. Default:0.5

Of note the Snakemake pipeline also uses the mergetsvs subcommand to merge evaluations from different runs (e.g. tools, call/genotype, regions). Run Rscript -e "sveval::wrapper(mergetsvs)" for more information.

Snakemake pipeline

In practice, VCF should be normalized beforehand (e.g. using bcftools), and we might want to evaluate multiple methods, across different regions set (e.g. whole-genome, confident regions, non-repeat regions), and for both calling and genotyping performance. To streamline this process, we use a Snakemake pipeline. This pipeline will perform the normalization, run all the evaluation, and summarize the results in TSV files and graphs. We show an example on how to evaluate calls against the GIAB SV truthset.

See the snakemake folder.

Methods

SV presence

To evaluate the calling performance, i.e. that the presence of a SV is correctly predicted no matter the exact genotype, we use a coverage-based approach. In brief we ask how much of a variant is "covered" by variants in the other set. In contrast to a simple reciprocal overlap, this approach is robust to call fragmentation.

The default criteria implemented are:

SV genotype

When evaluating exact genotypes, heterozygous and homozygous variants are overlapped separately (with the same criteria as described above). For each genotype, the overlap relationship is then used to build a bipartite graph. Each call variant is matched with a truth variant using bipartite clustering. All variants matched are considered true positives, and the rest errors.

To reduce the effect of fragmented calls, we can stitch variants that are extremely close into one longer variant. Similarly, if two heterozygous variants are extremely similar, they can be merged into one homozygous variant.

Docker

A docker image of R with this package installed is available at quay.io/jmonlong/sveval (e.g. quay.io/jmonlong/sveval:v2.3.0). More information about starting the container and using sveval or the snakemake pipeline at snakemake/README.md#start-the-docker-container.

Interactive exploration of the results

Interactive exploration of FPs/FNs/TPs

Using the explore_eval_svs function, an app starts in the web browser. False positives (FPs), false negatives (FNs), or true positives (TPs), can be quickly displayed together with the other SVs nearby. This is useful to double-check that SVs were matched properly and that FPs/FNs are really FPs/FNs. There is also a link to open the current region on the UCSC Genome Browser to help get a feel for the genomic regions, esp. its repeat content.

Interactive exploration of SVs in a variation graph

Using the ivg_sv function and a xg graph (vg must be installed):

The new version includes a linear representation of the variants in the region:

Reading a VCF with SVs

Reading a VCF in R can consume a lot of memory if it contains millions of variants, large variants with explicit sequence information, and in general rich INFO fields. Only a subset of this information is used by sveval in the end: all the SNV/indels are filtered, just the few typical SV INFO fields are used, and sometimes we don't want (or can't) use sequence information. To speed up the reading of large VCF, a C parser was implemented to only keep SVs and the relevant information.

The readSVvcf function is a wrapper to this parser and can return either a data.frame, a GRanges, or a VCF object as define by the VariantAnnotation package from Bioconductor.

The main parameters specify what information to keep and which variants to filter out right away:

BND and translocations

There are different ways of representing translocations (or "junctions") in VCF. The VCF parser here will recognize BND if the other genomic location is specified with either:

SVs with a SVTYPE of BND or TRA won't be filtered by the size filter (min.sv.size parameter).