lganel / SVScore

Prioritize structural variants based on CADD scores
MIT License
28 stars 5 forks source link

SVScore

SVScore is a VCF annotation tool which scores structural variants by predicted pathogenicity based on SNP-based CADD scores. For each variant, SVScore first defines important genomic intervals based on the variant type, breakend confidence intervals, and overlapping exon/intron annotations. It then applies an operation to each interval to aggregate the CADD scores in that interval into an interval score. A score for a given operation defined as the maximum of all interval scores calculated using that operation. SVScore is based on hg19/GRCh37.

For more information, please see our paper in Bioinformatics: https://doi.org/10.1093/bioinformatics/btw789

Usage

usage: ./svscore.pl [-dv] [-o op] [-e exonfile] [-f intronfile] [-c caddfile] -i vcf
    -i        Input VCF file. May be bgzip compressed (ending in .vcf.gz). Use "-i stdin" if using standard input
    -d        Debug mode, keeps intermediate and supporting files, displays progress
    -v        Verbose mode - show all calculated scores (left/right/span/ltrunc/rtrunc, as appropriate)
    -o        Comma-separated list of operations to perform on CADD score intervals (must be some combination of sum, max, mean, meanweighted, top\\d, and top\\dweighted - defaults to top10weighted)
    -e        Points to exon BED file (refGene.exons.bed)
    -f        Points to intron BED file (refGene.introns.bed)
    -c        Points to whole_genome_SNVs.tsv.gz (defaults to current directory)
    -s        Specifies version of svtools to be used (defaults to version installed under name "svtools")
    -t        Length threshold, in bp, above which SVs receive an automatic score of 100 (1,000,000)

    --help    Display this message
    --version Display version

First Time Setup

After downloading SVScore, there are a few steps to follow before it is ready to use.

  1. Test SVScore using sh tests/test.sh path/to/whole_genome_SNVs.tsv.gz
  2. Generate annotation files. For more on this, see Annotation Files.
  3. SVScore assumes the user's version of perl is installed in the default directory (/usr/bin/perl). If this is not the case, the first line of all .pl files should be changed to reflect the correct perl installation directory.

Annotation Files

Output

SVScore outputs a VCF file with scores added to the INFO field of each variant. The VCF header is also updated to include those scores which are added. Each score field has the following format: SVSCORE[op](_[interval]), where [op] represents the operation used to calculate that score (see Operations) and [interval] represents the interval over which the score was calculated, which is one of left breakend, right breakend, span (for DEL/DUP), left truncation score (for INV/DEL/INS variants which seem to truncate a transcript on the left side, the interval is from the most likely base of the left breakend to the end of the transcript), and right truncation score. Scores with no interval listed (such as SVSCOREMAX=) are the maximum over all intervals for that operation.

Intervals

For each variant, scores are calculated over a number of intervals which varies by SV type. The intervals chosen for each SV type, are described in Supported SV types and intervals

Truncation intervals are defined for each transcript which seems to be truncated by a variant. The interval extends from the most likely base of the furthest upstream breakend (LEFT for transcripts on the + strand, RIGHT for those on the - strand) to the end of the transcript. Each truncation score is the maximum over all transcripts truncated by a variant.

Supported SV types and intervals

LEFT RIGHT SPAN LTRUNC RTRUNC
DEL X X X X X
DUP X X X
INV X X X X
BND X X
INS X X X X
CNV X X X
MEI X X X X

To function correctly, SVScore requires that POS=END and CIPOS=CIEND for INS variants

LTRUNC and RTRUNC scores are only calculated when a breakend overlaps an exon or a breakend overlaps an intron which is not also touched by the opposite breakend.

Operations

-o specifies the operation(s) used to calculate SVScores. These operations are applied to each interval of the SV (see Supported SV types and intervals). This option takes an arbitrary-length, case insensitive, comma-separated list of operations from the following list:

For weighted operations, if PRPOS is not found in the header, SVScore will calculate unweighted means with a warning. If PRPOS or PREND is missing from a variant but is present in the header, that variant will receive a score of -1 for all weighted operations

For SPAN/LTRUNC/RTRUNC, these operations are applied to the scores of the bases in the interval. For LEFT/RIGHT intervals, the operations are applied to scores assigned to each possible breakpoint, which is calculated by taking the average of the 2 flanking bases (one on either side of the possible breakpoint)

Dependencies

The following must be in your path to use SVScore: svtools, vcfanno, tabix

Troubleshooting

Notes

The -s option should not be provided if svtools is present in the user's path as "svtools". This option should only be used if svtools is installed as "svtools-XXX", where XXX is the version number

If an input VCF file already has SVSCORE annotations in the INFO column, new annotations will overwrite old ones.

Input VCF files may be gzipped, but gzipped files must end with .gz. Uncompressed input files should not end with this suffix. Annotation files may be gzipped or unzipped. SVScore will zip/unzip files as necessary using bgzip and zcat.

For multiline variants, primary mate is considered the left breakend and the secondary mate is considered the right breakend.

If only one mate line of a multiline variant is present in the VCF file, left and right breakend scores are still calculated, as well as one truncation score if applicable (whether it is the left or right truncation score depends on whether the line describes a primary or secondary mate). There must be a CIEND interval in the INFO field for this to happen.

Variants with type DEL, DUP, or CNV which are over 1 Mb in length are automatically given a score of 100