BURST (formerly known as embalmer) is an optimal, high-speed pairwise sequence aligner specialized in aligning many NGS short reads against large reference databases.
As next-generation DNA sequencing data emerges faster than computational power can keep up, approximate heuristic solutions to the fundamental DNA alignment/mapping problem are increasingly used. Paradoxically, it seems, the more data we have, the less accurate the alignment algorithms used to analyze it. Algorithms with perfect sensitivity and specificity acheivable under mismatch constraints have been neglected in favor of techniques promising speedier alignment at the cost of absolute alignment quality (under some metrics of precision/accuracy/sensitivity/recall). BURST returns to the roots of provably optimal alignment algorithms, reinvigorating them with speedups as high as millions-fold without sacrificing any alignment quality whatsoever in the default operating modes.
BURST is a truly, mathematically optimal high-throughput end-to-end short-read DNA aligner. It supports:
BURST does not currently implement the following, although all these are planned in future releases:
See Releases page for precompiled binaries for a variety of systems with no dependencies. Basically, just download one of the files on the releases page appropriate for your system (Windows, Linux, or Mac) and run it on the command line. If on Windows, pick an ".exe" version; if on macOS pick a ".mac" version; if on Linux pick a ".linux" version. If the default version (burst.exe, burst.mac, burst.linux) doesn't work, try the corresponding version with ".older" in the name, and if that still doesn't work, try the one with ".buzzard." Please let me know if you can't get the program to run on your system.
burst -r myRefs.fasta -q myQueries.fasta -o myAlignments.b6
Create database
burst -r MyDB.fasta -a MyDB.acx -o MyDB.edx -d DNA -s
Search
The default search mode, CAPITALIST, reports the smallest set of references necessary to explain all tied hits:
burst -q myQueries.fasta -a MyDB.acx -r MyDB.edx -o output.txt
Note that burst can also report LCA taxonomy for each query sequence if taxonomy is provided with -b MyDB.tax
(a tab-delimited taxonomy file where the first column contains the entire sequence header for each sequence in the original fasta file, and the second column contains semi-colon-separated taxonomy). In this case the command above becomes:
burst -q myQueries.fasta -a MyDB.acx -r MyDB.edx -b MyDB.tax -o output.txt
BEST mode (report first best hit):
burst -q myQueries.fasta -a MyDB.acx -r MyDB.edx -m BEST -b MyDB.tax -o output.txt
ALLPATHS mode (larger output file; report all ties for best hit for every query sequence):
burst -q myQueries.fasta -a MyDB.acx -r MyDB.edx -m ALLPATHS -b MyDB.tax -o output.txt
burst -r MyDB.fasta -d DNA 320 -o MyDB.edx -a MyDB.acx -s 1 -i 0.97
to generate a database and accelerator. -f
when building to enable fingerprint clustering. Conversely, if you have insufficient memory to make a database using -d DNA
, consider using -dp 2
or higher (partitions ease memory use) or use the non-compressive database mode -d QUICK
.burst -r MyDB.edx -a MyDB.acx -q MyQueries.fasta -o myAlignments.b6
Other alignment modes, taxonomy parsing, tie-reporting, etc:
Output alignments are stored in the resulting .b6 file. This is a tab-delimited text file in BLAST-6 column format. Columns 11 and 12 instead refer to total edit distance (number of differences between query and reference in total) and whether the query is an exact duplicate of the query above it (1 if so), respectively. If taxonomy is assigned (-m CAPITALIST -b taxonomy.txt), that particular read's (interpolated if CAPITALIST) taxonomy is reported in column 13.
To find the latest version of BURST, see How above.
Please contact Gabe Al-Ghalith or Dan Knights* (I'm sure you can find our contact info!)
I downloaded the program for my system but it won't run! Says "Permission denied" or "command not found": If on Linux or Mac, you may have to run the command "chmod +x" on the program first, and then run the program inside of the directory that contains it using a dot and slash before the name (for example, on Linux: "./burst.linux" if the file "burst.linux" is within the current working directory of the terminal). Another solution is to add the directory containing the program to the system PATH. This technique may vary by operating system and terminal type.
All queries align with 100% identity, many to the same strange reference sequence: Uh oh, looks like your database contains long series of "N"s (ambiguous bases). Because all ambiguities are resolved according to IUPAC standards, N actually matches perfectly to anything. For example, the nucleotide "K" matches "Y" but not "M," although "M" matches "Y". Although this opens up exciting new possibilities for leveraging ambiguity in aligning to SNP-aware databases, psuedo-clusters, and more, a stretch of 300 N's present in some poorly-curated databases will match any length-300 query perfectly. Disable this behavior by passing -n or --npenalize, which will force N's to be treated as mismatches against A, C, G, or T/U in the query. N will still be considered a match to any ambiguous nucleotides in the query.
I get "segmentation fault" (or other crash): This is likely a bug with BURST! Please contact me with no less than the following and I'll try to fix it:
I get no alignments with my amplicon reads, even though I know they're legit:
Try reverse complementing (-fr
). If that doesn't work, try removing sequencing platform adaptors and cleaning up and trimming the reads with a QC pipeline.
Other program(s) give me more alignments; how can you say this is optimal?: First, more alignments doesn't mean correct alignments. Second, be careful when comparing technologies; BURST is a short-read aligner. It does not do local alignment like "BLAST" and hence does not do soft-trimming -- this is very much intentional and part of ensuring optimality of end-to-end alignments. An alignment of identity 97% spanning 97% of a query means that query is actually 97% x 97% = ~94% identical to its matched reference throughout.
It won't compile! It is not recommended to compile this software yourself unless you have the Intel compiler and a lot of patience for profile-guided optimization using multi-pass compilation. Your binary probably won't be as fast as the one provided on the release page. If you nonetheless insist, you must use ICC or GNU GCC (NOT Apple/LLVM CLANG) and provide the additional compiler flag -march=corei7
(or newer) and -fwhole-program -O3
(or -Ofast
)
Al-Ghalith, Gabriel and Dan Knights. BURST enables optimal exhaustive DNA alignment for big data. DOI 2017:doi.org/10.5281/zenodo.806850