knights-lab / BURST

An ultrafast optimal aligner for mapping large NGS data to large genome databases.
GNU Affero General Public License v3.0
57 stars 9 forks source link

BURST

BURST (formerly known as embalmer) is an optimal, high-speed pairwise sequence aligner specialized in aligning many NGS short reads against large reference databases.

Why

As next-generation DNA sequencing data emerges faster than computational power can keep up, approximate heuristic solutions to the fundamental DNA alignment/mapping problem are increasingly used. Paradoxically, it seems, the more data we have, the less accurate the alignment algorithms used to analyze it. Algorithms with perfect sensitivity and specificity acheivable under mismatch constraints have been neglected in favor of techniques promising speedier alignment at the cost of absolute alignment quality (under some metrics of precision/accuracy/sensitivity/recall). BURST returns to the roots of provably optimal alignment algorithms, reinvigorating them with speedups as high as millions-fold without sacrificing any alignment quality whatsoever in the default operating modes.

What

BURST is a truly, mathematically optimal high-throughput end-to-end short-read DNA aligner. It supports:

What not

BURST does not currently implement the following, although all these are planned in future releases:

What now

How

See Releases page for precompiled binaries for a variety of systems with no dependencies. Basically, just download one of the files on the releases page appropriate for your system (Windows, Linux, or Mac) and run it on the command line. If on Windows, pick an ".exe" version; if on macOS pick a ".mac" version; if on Linux pick a ".linux" version. If the default version (burst.exe, burst.mac, burst.linux) doesn't work, try the corresponding version with ".older" in the name, and if that still doesn't work, try the one with ".buzzard." Please let me know if you can't get the program to run on your system.

Easiest (Not for large reference databases or long reference sequences such as gigabase-length eukaryotic genomes):

burst -r myRefs.fasta -q myQueries.fasta -o myAlignments.b6

Fastest (short version):

  1. Create database

    burst -r MyDB.fasta -a MyDB.acx -o MyDB.edx -d DNA -s
  2. Search

The default search mode, CAPITALIST, reports the smallest set of references necessary to explain all tied hits:

burst -q myQueries.fasta -a MyDB.acx -r MyDB.edx -o output.txt

Note that burst can also report LCA taxonomy for each query sequence if taxonomy is provided with -b MyDB.tax (a tab-delimited taxonomy file where the first column contains the entire sequence header for each sequence in the original fasta file, and the second column contains semi-colon-separated taxonomy). In this case the command above becomes:

burst -q myQueries.fasta -a MyDB.acx -r MyDB.edx -b MyDB.tax -o output.txt

BEST mode (report first best hit):

burst -q myQueries.fasta -a MyDB.acx -r MyDB.edx -m BEST -b MyDB.tax -o output.txt

ALLPATHS mode (larger output file; report all ties for best hit for every query sequence):

burst -q myQueries.fasta -a MyDB.acx -r MyDB.edx -m ALLPATHS -b MyDB.tax -o output.txt

Fastest (longer, more detailed version):

  1. Decide on the maximum lengths your queries will be, and the minimum identity you require of qualifying alignments. For example, for max query length of 320 bases and minimum identity of 0.97 (97%), you'd pass "-d DNA 320" and "-i 0.97" like below. Note: databases assuming shorter maximum query length and higher minimum identities will run faster. If you only have HiSeq 125-bp data and you're only interested in alignments of 98% identity or better, you'd want to use something like "-d DNA 125" and "-i 0.98" instead.
  2. Run burst -r MyDB.fasta -d DNA 320 -o MyDB.edx -a MyDB.acx -s 1 -i 0.97 to generate a database and accelerator.
  3. (optional, advanced) To refine the database, you can specify -f when building to enable fingerprint clustering. Conversely, if you have insufficient memory to make a database using -d DNA, consider using -dp 2 or higher (partitions ease memory use) or use the non-compressive database mode -d QUICK.
  4. Use the database for all future alignments: burst -r MyDB.edx -a MyDB.acx -q MyQueries.fasta -o myAlignments.b6

Other alignment modes, taxonomy parsing, tie-reporting, etc:

Where

Output alignments are stored in the resulting .b6 file. This is a tab-delimited text file in BLAST-6 column format. Columns 11 and 12 instead refer to total edit distance (number of differences between query and reference in total) and whether the query is an exact duplicate of the query above it (1 if so), respectively. If taxonomy is assigned (-m CAPITALIST -b taxonomy.txt), that particular read's (interpolated if CAPITALIST) taxonomy is reported in column 13.

To find the latest version of BURST, see How above.

Who

Please contact Gabe Al-Ghalith or Dan Knights* (I'm sure you can find our contact info!)

Woah (troubleshooting)

  1. I downloaded the program for my system but it won't run! Says "Permission denied" or "command not found": If on Linux or Mac, you may have to run the command "chmod +x" on the program first, and then run the program inside of the directory that contains it using a dot and slash before the name (for example, on Linux: "./burst.linux" if the file "burst.linux" is within the current working directory of the terminal). Another solution is to add the directory containing the program to the system PATH. This technique may vary by operating system and terminal type.

  2. All queries align with 100% identity, many to the same strange reference sequence: Uh oh, looks like your database contains long series of "N"s (ambiguous bases). Because all ambiguities are resolved according to IUPAC standards, N actually matches perfectly to anything. For example, the nucleotide "K" matches "Y" but not "M," although "M" matches "Y". Although this opens up exciting new possibilities for leveraging ambiguity in aligning to SNP-aware databases, psuedo-clusters, and more, a stretch of 300 N's present in some poorly-curated databases will match any length-300 query perfectly. Disable this behavior by passing -n or --npenalize, which will force N's to be treated as mismatches against A, C, G, or T/U in the query. N will still be considered a match to any ambiguous nucleotides in the query.

  3. I get "segmentation fault" (or other crash): This is likely a bug with BURST! Please contact me with no less than the following and I'll try to fix it:

    • The exact command-line used to run the program
    • The version of BURST used (run with -h to see help)
    • The operating system and amount of RAM (memory) in the computer running it
    • A minimalistic example of input and output to reproduce the problem. If it occurs using a DB (.edx), include the fasta file used to produce the DB.
  4. I get no alignments with my amplicon reads, even though I know they're legit: Try reverse complementing (-fr). If that doesn't work, try removing sequencing platform adaptors and cleaning up and trimming the reads with a QC pipeline.

  5. Other program(s) give me more alignments; how can you say this is optimal?: First, more alignments doesn't mean correct alignments. Second, be careful when comparing technologies; BURST is a short-read aligner. It does not do local alignment like "BLAST" and hence does not do soft-trimming -- this is very much intentional and part of ensuring optimality of end-to-end alignments. An alignment of identity 97% spanning 97% of a query means that query is actually 97% x 97% = ~94% identical to its matched reference throughout.

  6. It won't compile! It is not recommended to compile this software yourself unless you have the Intel compiler and a lot of patience for profile-guided optimization using multi-pass compilation. Your binary probably won't be as fast as the one provided on the release page. If you nonetheless insist, you must use ICC or GNU GCC (NOT Apple/LLVM CLANG) and provide the additional compiler flag -march=corei7 (or newer) and -fwhole-program -O3 (or -Ofast)

Cite

Al-Ghalith, Gabriel and Dan Knights. BURST enables optimal exhaustive DNA alignment for big data. DOI 2017:doi.org/10.5281/zenodo.806850

DOI (please cite using DOI until manuscript is published)