A flexible framework for rapid genome analysis and interpretation
C Chiang, R M Layer, G G Faust, M R Lindberg, D B Rose, E P Garrison, G T Marth, A R Quinlan, and I M Hall. SpeedSeq: ultra-fast personal genome analysis and interpretation. Nat Meth (2015). doi:10.1038/nmeth.3505.
http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.3505.html
Install
git clone --recursive https://github.com/hall-lab/speedseq
cd speedseq
make
Run the example script
cd example
./run_speedseq
This should produce the following files:
As a template for installation on other systems, we have provided the exact commands for a full installation of SpeedSeq and GEMINI on a blank [Amazon Linux](Amazon Linux AMI 2014.09.2 (HVM)) box. These commands encompass all of the installation steps outlined below.
System paths to SpeedSeq's component software are specified in the speedseq.config file, which should reside in the same directory as the SpeedSeq executable (for alternate locations use the -K flag). Upon installation, SpeedSeq attempts to automatically generate this file, but manual editing may be necessary.
The core components enable standard functionality outlined in Quick start.
Compilation requires g++ and the standard C and C++ development libraries. Additionally, cmake is required for building the BamTools API within FreeBayes and LUMPY.
git clone --recursive https://github.com/hall-lab/speedseq
cd speedseq
make
Essential SpeedSeq components can be installed with make
, which produces a log file (install.log) that details the compilation status.
The installation is modular, and its units can be built separately with make align
, make var
, make somatic
, make sv
, and make realign
. This allows installation of only the desired components, eliminating extraneous dependencies. It further allows rebuilding of previously failed components.
If any components already exist on the system or fail to install, their paths can be manually specified by editing speedseq.config.
Optional components enable advanced features such as variant annotation and read-depth analysis.
curl -OL https://github.com/Ensembl/ensembl-tools/archive/release/76.zip
unzip 76.zip
perl ensembl-tools-release-76/scripts/variant_effect_predictor/INSTALL.pl \
-c $SPEEDSEQ_DIR/annotations/vep_cache \
-a ac -s homo_sapiens -y GRCh37
cp ensembl-tools-release-76/scripts/variant_effect_predictor/variant_effect_predictor.pl $SPEEDSEQ_DIR/bin
cp -r Bio $SPEEDSEQ_DIR/bin
# Update the VEP and VEP_CACHE_DIR variables in speedseq.config to point to
# $SPEEDSEQ_DIR/bin/variant_effect_predictor.pl and $SPEEDSEQ_DIR/annotations/vep_cache
CNVnator requires the ROOT package as a prerequiste (https://root.cern.ch/drupal/)
Install the ROOT package
curl -OL ftp://root.cern.ch/root/root_v5.34.20.source.tar.gz
tar -zxvf root_v5.34.20.source.tar.gz
cd root
./configure --prefix=$PWD
make
Source thisroot.sh
source /pathto/root/bin/thisroot.sh
Compile CNVnator from the SpeedSeq directory
cd $SPEEDSEQ_DIR
make cnvnator
Before running SpeedSeq, you'll need to add the following line to speedseq.config or your .bashrc file. (Substitute the actual path to thisroot.sh on your system)
source /pathto/root/bin/thisroot.sh
Please refer to the CNVnator repository for details on installing CNVnator.
We recommend using the GRCh37 human genome for SpeedSeq, available here:
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.fai
The genome FASTA file should be unzipped and indexed with BWA before running SpeedSeq.
For human genome alignment using the GRCh37 build, we recommend using the annotations/ceph18.b37.include.2014-01-15.bed windows to parallelize variant calling (speedseq var
and speedseq somatic
). This BED file excludes 15.6 Mb of the non-gapped genome where the coverage in the CEPH1463 pedigree was greater than twice the mode coverage plus 3 standard deviations. We believe these extremely high depth regions that we excluded are areas of misassembly in the GRCh37 human reference genome in which variant calling is time-consuming and error-prone.
Additionally, the regions in annotations/ceph18.b37.include.2014-01-15.bed are variable-width windows which each contain approximately the same coverage depth in the CEPH1463 pedigree, and sorted from highest to lowest depth. This ensures that the parallelization of Freebayes uses approximately the same amount of time per region.
The regions in annotations/ceph18.b37.exclude.2014-01-15.bed represent the complement of the regions in annotations/ceph18.b37.include.2014-01-15.bed.
In the speedseq sv
module, we recommend excluding the genomic regions in the annotations/ceph18.b37.lumpy.exclude.2014-01-15.bed BED file. These regions represent the complement of those in annotations/ceph18.b37.include.2014-01-15.bed as well as the mitochondrial chromosome.
SpeedSeq is a modular framework with four components:
These modules operate independently of each other and produce universal output formats that are compatible with external tools. SpeedSeq modules can also run on BAM alignments that were produced outside of the SpeedSeq framework. . However, structural variant detection on BAM files generated outside of SpeedSeq will be slower due to two unique features of speedseq align
. First, our alignment uses SAMBLASTER to automatically extract split and discordant reads for SV detection. While the speedseq sv
module will internally extract split and discordant reads from regular BAM files, it takes much longer due to obligate name-sorting of the BAM file. Secondly, structural variant genotyping is much faster on BAM files processed by SAMBLASTER due to the addition of mate CIGAR and mate mapping quality tags. In the absence of these tags, SVTyper must jump to each read’s mate position in the BAM file, which greatly increases run time.
speedseq align
converts paired-end FASTQ sequences to a duplicate-marked, sorted, indexed BAM file that can be processed with other SpeedSeq modules.
Internally, speedseq align
runs the following steps to produce three output BAM files:
usage: speedseq align [options] <reference.fa> <in1.fq> [in2.fq]
reference.fa genome reference fasta file (required)
in1.fq paired-end fastq file. if -p flag is used then expected to be
an interleaved paired-end fastq file, and in2.fq may be omitted.
(may be gzipped) (required)
in2.fq paired-end fastq file. (may be gzipped) (required)
-o STR output prefix [default: in1.fq]
-R read group header line such as "@RG\tID:id\tSM:samplename\tLB:lib" (required)
-p first fastq file consists of interleaved paired-end sequences
-t INT number of threads to use [default: 1]
-T DIR temp directory [./outprefix.XXXXXXXXXXXX]
-I FLOAT[,FLOAT[,INT[,INT]]]
specify the mean, standard deviation (10% of the mean if absent), max
(4 sigma from the mean if absent) and min of the insert size distribution.
FR orientation only. [inferred]
-i include duplicates in splitters and discordants
(default: exclude duplicates)
-c INT maximum number of split alignments for a read to be
included in splitter file [default: 2]
-m INT minimum non-overlapping base pairs between two alignments
for a read to be included in splitter file [default: 20]
-M amount of memory in GB to be used for sorting [default: 20]
-K FILE path to speedseq.config file (default: same directory as speedseq)
-v verbose
-h show help message
speedseq align
produces three sorted, indexed BAM files (plus their corresponding .bai index files):
outprefix.bam
speedseq var
, speedseq somatic
, and speedseq sv
.outprefix.splitters.bam
-S
flag input to speedseq sv
. This file excludes duplicate reads by default, but they will be included if the -i
flag is specified as a speedseq align
command line parameter.outprefix.discordants.bam
-D
flag input to speedseq sv
. This file excludes duplicate reads by default, but they will be included if the -i
flag is specified as a speedseq align
command line parameter.speedseq var
runs FreeBayes on one or more BAM files.
usage: speedseq var [options] <reference.fa> <input1.bam> [input2.bam [...]]
reference.fa genome reference fasta file
input.bam BAM file(s) to call variants on. Must have readgroup information,
and the SM readgroup tags will be the VCF column headers
-o STR output prefix [default: input1.bam]
-w FILE BED file of windowed genomic intervals. For human genomes,
we recommend using the annotations/ceph18.b37.include.2014-01-15.bed
(see Annotations)
-q FLOAT minimum variant QUAL score to output [1]
-t INT number of threads to use [default: 1]
-T DIR temp directory [./outprefix.XXXXXXXXXXXX]
-A annotate the vcf with VEP
-K FILE path to speedseq.config file [default: same directory as speedseq]
-v verbose
-h show help message
speedseq var
produces a single indexed VCF file that is optionally annotated with VEP.
outprefix.vcf.gz
speedseq somatic
runs FreeBayes on a tumor/normal pair of BAM files
usage: speedseq somatic [options] <reference.fa> <normal.bam> <tumor.bam>
reference.fa genome reference fasta file
normal.bam germline BAM file(s) (comma separated BAMs from multiple libraries).
Must have readgroup information, and the SM readgroup tag will
be the VCF column header
tumor.bam tumor BAM file(s) (comma separated BAMs for multiple libraries).
Must have readgroup information, and the SM readgroup tag will
be the VCF column header
-o STR output prefix [default: tumor.bam]
-w FILE BED file of windowed genomic intervals. For human genomes,
we recommend using the annotations/ceph18.b37.include.2014-01-15.bed
(see Annotations)
-t INT number of threads to use [default: 1]
-F FLOAT require at least this fraction of observations supporting
an alternate allele within a single individual in order
to evaluate the position [0.05]
-C INT require at least this count of observations supporting
an alternate allele within a single individual in order
to evaluate the position [2]
-S FLOAT minimum somatic score (SSC) for PASS [18]
-q FLOAT minimum QUAL score to output non-passing somatic variants [1e-5]
-T DIR temp directory [./outprefix.XXXXXXXXXXXX]
-A annotate the vcf with VEP
-K FILE path to speedseq.config file (default: same directory as speedseq)
-v verbose
-h show help message
speedseq somatic
produces a single indexed VCF file that is optionally annotated with VEP.
outprefix.vcf.gz
speedseq sv
runs LUMPY on one or more BAM files, with optional breakend genotyping by SVTyper, and optional read-depth analysis by CNVnator.
-B FILE full BAM file(s) (comma separated) (required)
example: -B in1.bam,in2.bam,in3.bam
-S FILE split reads BAM file(s) (comma separated, order same as in -B)
(auto-generated if absent)
example: -S in1.splitters.bam,in2.splitters.bam,in3.splitters.bam
-D FILE discordant reads BAM file(s) (comma separated, order same as in -B)
(auto-generated if absent)
example: -D in1.discordants.bam,in2.discordants.bam,in3.discordants.bam
-R FILE indexed reference genome fasta file (required)
-o STR output prefix [in1.bam]
-t INT threads [1]
-x FILE BED file to exclude
-g genotype SV breakends with svtyper
-d calculate read-depth with CNVnator
-A annotate the vcf with VEP
-P output LUMPY probability curves in VCF
-m INT minimum sample weight for a call [default: 4]
-r FLOAT trim threshold [0]
-T DIR temp directory [./outprefix.XXXXXXXXXXXX]
-k keep temporary files
-K FILE path to speedseq.config file (default: same directory as speedseq)
-v verbose
-h show help message
speedseq sv
produces a bgzipped, indexed VCF file.
outprefix.sv.vcf.gz
speedseq realign
allows alignment from one or more BAM files, rather than FASTQ inputs. It automatically parses read group information from the BAM header to mark duplicates by library.
usage: speedseq realign [options] <reference.fa> <in1.bam> [in2.bam [...]]
reference.fa genome reference fasta file (indexed with bwa)
in.bam BAM file(s) (must contain read group tags)
-o STR output prefix [in.realign]
-I FLOAT[,FLOAT[,INT[,INT]]]
specify the mean, standard deviation (10% of the mean if absent), max
(4 sigma from the mean if absent) and min of the insert size distribution.
FR orientation only. [inferred]
-n rename reads for smaller file size
-t INT threads [1]
-T DIR temp directory [./output_prefix.XXXXXXXXXXXX]
-i include duplicates in splitters and discordants
(default: exclude duplicates)
-c INT maximum number of split alignments for a read to be
included in splitter file [default: 2]
-m INT minimum non-overlapping base pairs between two alignments
for a read to be included in splitter file [default: 20]
-M amount of memory in GB to be used for sorting [default: 20]
-K FILE path to speedseq.config file (default: same directory as speedseq)
-v verbose
-h show help message
speedseq realign
output is identical to that produced by speedseq align
.
Use speedseq align
to produce a sorted, duplicate-marked, BAM alignment from paired-end fastq data.
speedseq align \
-o NA12878 \
-R "@RG\tID:NA12878.S1\tSM:NA12878\tLB:lib1" \
human_g1k_v37.fasta \
NA12878.1.fq.gz \
NA12878.2.fq.gz
Note: if using an interleaved paired-end fastq file, use the -p
flag
speedseq align \
-o NA12878 \
-p \
-R "@RG\tID:NA12878.S1\tSM:NA12878\tLB:lib1" \
human_g1k_v37.fasta \
NA12878.interleaved.fq.gz
Use speedseq var
to call SNVs and indels on a single sample.
speedseq var \
-o NA12878 \
-w annotations/ceph18.b37.include.2014-01-15.bed \
human_g1k_v37.fasta \
NA12878.bam
Use speedseq sv
to call structural variants. The optional -g
and -d
flags perform breakend genotyping and read-depth calculation respectively
speedseq sv \
-o NA12878 \
-x annotations/ceph18.b37.lumpy.exclude.2014-01-15.bed \
-g \
-d \
-B NA12878.bam \
-D NA12878.discordants.bam \
-S NA12878.splitters.bam
Use speedseq align
to produce a sorted, duplicate-marked, BAM alignment of each library.
speedseq align -o NA12878_S1 -R "@RG\tID:NA12878.S1\tSM:NA12878\tLB:lib1" \
human_g1k_v37.fasta \
NA12878.S1.1.fq.gz \
NA12878.S1.2.fq.gz
speedseq align -o NA12878_S2 -R "@RG\tID:NA12878.S2\tSM:NA12878\tLB:lib2" \
human_g1k_v37.fasta \
NA12878.S2.1.fq.gz \
NA12878.S2.2.fq.gz
speedseq align -o NA12878_S3 -R "@RG\tID:NA12878.S3\tSM:NA12878\tLB:lib3" \
human_g1k_v37.fasta \
NA12878.S3.1.fq.gz \
NA12878.S3.2.fq.gz
Merge the samples
sambamba merge NA12878_merged.bam NA12878_S1.bam NA12878_S2.bam NA12878_S3.bam
sambamba index NA12878_merged.bam
Use speedseq var
to call SNVs and indels.
speedseq var \
-o NA12878 \
-w annotations/ceph18.b37.include.2014-01-15.bed \
human_g1k_v37.fasta \
NA12878_merged.bam
Use speedseq sv
to call structural variants.
speedseq -sv \
-o NA12878 \
-x annotations/ceph18.b37.lumpy.exclude.2014-01-15.bed \
-B NA12878_merged.bam \
-S NA12878_merged.splitters.bam \
-D NA12878_merged.discordants.bam \
-R human_g1k_v37.fasta
Use speedseq align
to produce sorted, duplicate-marked, BAM alignments for each sample.
speedseq align \
-o NA12877 \
-R "@RG\tID:NA12877.S1\tSM:NA12877\tLB:lib1" \
human_g1k_v37.fasta \
NA12877.1.fq.gz \
NA12877.2.fq.gz
speedseq align \
-o NA12878 \
-R "@RG\tID:NA12878.S1\tSM:NA12878\tLB:lib2" \
human_g1k_v37.fasta \
NA12878.1.fq.gz \
NA12878.2.fq.gz
speedseq align \
-o NA12879 \
-R "@RG\tID:NA12879.S1\tSM:NA12879\tLB:lib3" \
human_g1k_v37.fasta \
NA12879.1.fq.gz \
NA12879.2.fq.gz
Use speedseq var
to call SNVs and indels on multiple samples.
speedseq var \
-o cephtrio \
-w annotations/ceph18.b37.include.2014-01-15.bed \
human_g1k_v37.fasta \
NA12877.bam \
NA12878.bam \
NA12879.bam
Use speedseq sv
to call structural variants on multiple samples.
speedseq sv \
-o cephtrio \
-x annotations/ceph18.b37.lumpy.exclude.2014-01-15.bed \
-B NA12877.bam,NA12878.bam,NA12879.bam \
-D NA12877.discordants.bam,NA12878.discordants.bam,NA12879.discordants.bam \
-S NA12877.splitters.bam,NA12878.splitters.bam,NA12879.splitters.bam
Use speedseq align
to produce sorted, duplicate-marked, BAM alignments for the tumor/normal pair
speedseq align \
-o TCGA-B6-A0I6.normal \
-p \
-R "@RG\tID:TCGA-B6-A0I6-10A-01D-A128-09\tSM:TCGA-B6-A0I6-10A-01D-A128-09\tLB:lib1" \
human_g1k_v37.fasta \
TCGA-B6-A0I6-10A-01D-A128-09.interleaved.fq.gz
speedseq align \
-o TCGA-B6-A0I6.tumor \
-p \
-R "@RG\tID:TCGA-B6-A0I6-10A-01D-A128-09\tSM:TCGA-B6-A0I6-10A-01D-A128-09\tLB:lib1" \
human_g1k_v37.fasta \
TCGA-B6-A0I6-01A-11D-A128-09.interleaved.fq.gz
Use speedseq somatic
to call SNVs and indels on the tumor/normal pair.
speedseq somatic \
-o TCGA-B6-A0I6 \
-w annotations/ceph18.b37.include.2014-01-15.bed \
-F 0.05 \
-q 1 \
human_g1k_v37.fasta \
TCGA-B6-A0I6.normal.bam \
TCGA-B6-A0I6.tumor.bam
Use speedseq sv
to call structural variants on the tumor/normal pair.
speedseq sv \
-o TCGA-B6-A0I6 \
-x annotations/ceph18.b37.lumpy.exclude.2014-01-15.bed \
-B TCGA-B6-A0I6.normal.bam,TCGA-B6-A0I6.tumor.bam \
-D TCGA-B6-A0I6.normal.discordants.bam,TCGA-B6-A0I6.tumor.discordants.bam \
-S TCGA-B6-A0I6.normal.splitters.bam,TCGA-B6-A0I6.tumor.splitters.bam \
-R human_g1k_v37.fasta
Use speedseq align
to produce sorted, duplicate-marked, BAM alignments (as shown above).
Use speedseq var
to call SNVs and indels on multiple samples with high sensitivity.
speedseq var \
-o trio \
-w annotations/ceph18.b37.include.2014-01-15.bed \
-q 0.01 \
human_g1k_v37.fasta \
mother.bam \
father.bam \
child.bam
Filter variants for de novo status using GEMINI's built-in analysis tools.
SpeedSeq is available as a public AMI (Amazon Machine Image) on the Amazon Elastic Compute Cloud (EC2).
Log in to the Amazon AWS console
From the EC2 Dashboard, set your region to N. Virginia and click "Launch Instance"
Choose, "Community AMIs" in the sidebar and search for SpeedSeq in the dialog box.
Select hardware specifications. For deep whole genomes, we recommend c3.8xlarge (32 vCPUs, 60 GB RAM). For testing purposes any machine with 16 GB RAM is sufficient.
Add storage for the data. Note that the SpeedSeq footprint is ~ 26 GB (including the reference genome and GEMINI).
Launch the instance and log in.
ssh -i mykey.pem ec2-user@ec2-54-173-62-218.compute-1.amazonaws.com
Run the SpeedSeq test script.
./run_speedseq.sh
This should produce the following files:
speedseq var
or speedseq somatic
. However, you should not use the excluded regions, as these were designed for WGS data. SpeedSeq cannot detect SVs from exome data.If you encounter errors or strange behavior from SpeedSeq, please report them to the issues page with the following information:
Installation failure with error: "No targets specified and no makefile found."
Ensure that SpeedSeq was cloned with the
--recursive
flag
Installation failure while compiling FreeBayes or LUMPY (in the var and sv modules respectively)
These two components use BamTools, which requires CMake for compilation. Ensure that CMake is installed on your system
Installation reports, "WARNING: CNVnator not compiled because the ROOT package is not installed. Please see the README for instructions on manually installing ROOT."
This indicates that the ROOT package has not been installed, or the $ROOTSYS variable has not been set. See the CNVnator repository for details.
Runtime error: "ImportError: No module named argparse"
Ensure you are running Python 2.7 or later.
CNVnator fails to run during speedseq sv
This may be due to a problem with the ROOT installation. Try configuring the ROOT package without the --prefix flag. Then run
./configure make
Add /pathto/root/bin/thisroot.sh to the speedseq.config file
Python errors
Python errors commonly result from incompatibilities with older versions of Pysam. SpeedSeq runs on Pysam versions 0.8.0 and newer.