pmelsted / bifrost

Bifrost: Highly parallel construction and indexing of colored and compacted de Bruijn graphs
BSD 2-Clause "Simplified" License
204 stars 25 forks source link

Bifrost

Parallel construction, indexing and querying of colored and compacted de Bruijn graphs

Other tools integrating or using Bifrost: Kallisto, Ratatosk, ggCaller, popIns2, PLAST and more.

Table of Contents

Requirements

It is highly recommended to install Bifrost from source. However, a Conda installation is possible (see Section Installation). Bifrost requirements are pre-installed by default on most OS:

In case you are missing on or more of those:

Installation

Large k-mers

The default maximum k-mer size supported is 31. To work with larger k in the binary, you must install Bifrost from source and replace MAX_KMER_SIZE with a larger multiple of 32. This can be done in two ways:

Actual maximum k-mer size is MAX_KMER_SIZE-1, e.g maximum k is 63 for MAX_KMER_SIZE=64. Increasing MAX_KMER_SIZE increases Bifrost memory usage (k=31 uses 8 bytes of memory per k-mer while k=63 uses 16 bytes of memory per k-mer).

The maximum size of minimizers (g-mers) MAX_GMER_SIZE can be adjusted the same way as MAX_KMER_SIZE. This is especially useful if you want to use a large k-mer size but a small g-mer size. By default, MAX_GMER_SIZE is equal to MAX_KMER_SIZE.

To work with larger k when using the Bifrost API, the new value MAX_KMER_SIZE must be given to the compiler and linker as explained in Section API

Binary usage:

Bifrost

displays the command line interface:

Bifrost x.y.z

Highly parallel construction, indexing and querying of colored and compacted de Bruijn graphs

Usage: Bifrost [COMMAND] [PARAMETERS]

[COMMAND]:

   build                   Build a compacted de Bruijn graph, with or without colors
   update                  Update a compacted (colored) de Bruijn graph with new sequences
   query                   Query a compacted (colored) de Bruijn graph

[PARAMETERS]: build

   > Mandatory with required argument:

   -s, --input-seq-file     Input sequence file in fasta/fastq(.gz) format
                            Multiple files can be provided as a list in a text file (one file per line)
                            K-mers with exactly 1 occurrence in the input sequence files will be discarded
   -r, --input-ref-file     Input reference file in fasta/fastq(.gz) or gfa(.gz) format
                            Multiple files can be provided as a list in a text file (one file per line)
                            All k-mers of the input reference files are used
   -o, --output-file        Prefix for output file(s)

   > Optional with required argument:

   -t, --threads            Number of threads (default: 1)
   -k, --kmer-length        Length of k-mers (default: 31)
   -m, --min-length         Length of minimizers (default: auto)
   -B, --bloom-bits         Number of Bloom filter bits per k-mer (default: 24)
   -T, --tmp-dir            Path for tmp directory (default: creates tmp directory in output directory)
   -l, --load-mbbf          Input Blocked Bloom Filter file, skips filtering step (default: no input)
   -w, --write-mbbf         Output Blocked Bloom Filter file (default: no output)

   > Optional with no argument:

   -c, --colors             Color the compacted de Bruijn graph
   -i, --clip-tips          Clip tips shorter than k k-mers in length
   -d, --del-isolated       Delete isolated contigs shorter than k k-mers in length
   -f, --fasta-out          Output file in fasta format (only sequences) instead of gfa (unless graph is colored)
   -b, --bfg-out            Output file in bfg/bfi format (Bifrost graph/index) instead of gfa (unless graph is colored)
   -n, --no-compress-out    Output files must be uncompressed
   -N, --no-index-out       Do not make index file
   -v, --verbose            Print information messages during execution

[PARAMETERS]: update

  > Mandatory with required argument:

   -g, --input-graph-file   Input graph file to update in gfa(.gz) or bfg format
   -s, --input-seq-file     Input sequence file in fasta/fastq(.gz) format
                            Multiple files can be provided as a list in a text file (one file per line)
                            K-mers with exactly 1 occurrence in the input sequence files will be discarded
   -r, --input-ref-file     Input reference file in fasta/fastq(.gz) or gfa(.gz) format
                            Multiple files can be provided as a list in a text file (one file per line)
                            All k-mers of the input reference files are used
   -o, --output-file        Prefix for output file(s)

   > Optional with required argument:

   -I, --input-index-file   Input index file associated with graph to update in bfi format
   -C, --input-color-file   Input color file associated with graph to update in color.bfg format
   -t, --threads            Number of threads (default: 1)
   -k, --kmer-length        Length of k-mers (default: read from input graph file if built with Bifrost or 31)
   -m, --min-length         Length of minimizers (default: read from input graph if built with Bifrost, auto otherwise)
   -T, --tmp-dir            Path for tmp directory (default: creates tmp directory in output directory)

   > Optional with no argument:

   -i, --clip-tips          Clip tips shorter than k k-mers in length
   -d, --del-isolated       Delete isolated contigs shorter than k k-mers in length
   -f, --fasta-out          Output file in fasta format (only sequences) instead of gfa (unless colors are output)
   -b, --bfg-out            Output file in bfg/bfi format (Bifrost graph/index) instead of gfa (unless graph is colored)
   -n, --no-compress-out    Output files must be uncompressed
   -N, --no-index-out       Do not make index file
   -v, --verbose            Print information messages during execution

[PARAMETERS]: query

  > Mandatory with required argument:

   -g, --input-graph-file   Input graph file to query in gfa(.gz) or bfg
   -q, --input-query-file   Input query file in fasta/fastq(.gz). Each record is a query.
                            Multiple files can be provided as a list in a text file (one file per line)
   -o, --output-file        Prefix for output file

   > Optional with required argument:

   -e, --min_ratio-kmers    Minimum ratio of k-mers from each query that must occur in the graph
   -E, --min-nb-colors      Minimum number of colors from each query that must occur in the graph
   -I, --input-index-file   Input index file associated with graph to query in bfi format
   -C, --input-color-file   Input color file associated with the graph to query in color.bfg format
   -t, --threads            Number of threads (default: 1)
   -k, --kmer-length        Length of k-mers (default: read from input graph if built with Bifrost or 31)
   -m, --min-length         Length of minimizers (default: read from input graph if built with Bifrost, auto otherwise)
   -T, --tmp-dir            Path for tmp directory (default: creates tmp directory in output directory)

   > Optional with no argument:

   -Q, --files-as-queries   All fastq/fastq records in each input query file constitute a single query.
   -p, --ratio-found-km     Output the ratio of found k-mers for each query (disable parameters -e and -E)
   -a, --approximate        Graph is searched using exact and inexact k-mers (1 substitution or indel allowed per k-mer)
   -v, --verbose            Print information messages during execution

Use cases

The following use cases describe some simple and common usage of the Bifrost CLI. However, many more options are provided by the CLI to perform more specific operations (graph cleaning, approximate querying, etc.).

API

Changes in the API are reported in the Changelog.

Tutorial

The API tutorial should help you get started with the C++ API.

Documentation

Documentation for the Bifrost library is available in the /doc/doxygen folder. The following command generates the documentation files:

cd <bifrost_directory>
doxygen Doxyfile

Then, open html/index.html. The documentation contains a description of all the functions and structures of the library.

Usage

The Bifrost C++ API can be used by adding

#include <bifrost/CompactedDBG.hpp>

for uncolored compacted de Bruijn graphs and

#include <bifrost/ColoredCDBG.hpp>

for colored compacted de Bruijn graphs in your C++ headers.

To compile, we recommend using the following compile flags:

-O3 -std=c++11

Furthermore, Bifrost compiles by default with flag -march=native so unless native compilation was disabled when installing Bifrost, use flag -march=native too.

Finally, use the following flags for linking:

-lbifrost -pthread -lz

You can also link to the Bifrost static library (libbifrost.a) for better performance:

<path_to_lib_folder>/libbifrost.a -pthread -lz

The default maximum k-mer size supported is 31. To work with larger k, the code using the Bifrost C++ API must be compiled and linked with the flag -DMAX_KMER_SIZE=x for compiling and linking where x is a larger multiple of 32, such as:

-DMAX_KMER_SIZE=64

Actual maximum k-mer size is MAX_KMER_SIZE-1, e.g maximum k is 63 for MAX_KMER_SIZE=64. Increasing MAX_KMER_SIZE increases Bifrost memory usage (k=31 uses 8 bytes of memory per k-mer while k=63 uses 16 bytes of memory per k-mer).

FAQ

Can I provide multiple files in input?

Yes, use parameter -r or -s for each file to input.

Can I provide a list of files in input?

Yes, a text file containing one input filename per line with no empty lines can be used in input.

What are the accepted input file formats?

FASTA, FASTQ, GFA and Bifrost binary file format. Input FASTA, FASTQ and GFA can be compressed with gzip (extension .gz). If you input a GFA file for the construction, use the -r parameter.

Can I use different file formats in input?

Yes.

If I input a GFA file for building the de Bruijn graph, does it need to contain an already compacted de Bruijn graph?

No, it can contain any type of sequence graph (like an uncompacted de Bruijn graph or a sequence graph).

Can I build a compacted (colored) de Bruijn graph from assembled genomes and reads?

Yes. Input your assembled genomes with parameter -r and your reads with parameter -s.

Can I use a graph file without its color file ?

Yes. Just do not input the color file and Bifrost will consider it is an uncolored compacted de Bruijn graph.

In which order are inserted the colors?

A color corresponds to an input file the graph was built/updated from. The order in which the colors are inserted is the same as the order of the files given by parameter -r and parameter -s. However, in case both parameters -r and -s are used, no assumption can be made on whether the files given by parameter -s will be inserted before or after the ones given by parameter -r.

Different runs of Bifrost on the same dataset with the same parameters produces graphs with different unitigs. Which graph is correct?

All of them. The difference between the graphs resides in circular unitigs (unitigs connecting to themselves) which are their own connected components ("isolated"). These unitigs can have a different sequence from one run to another because the starting position will be different, yet they represent exactly the same sequence. As an example, circular unitig ATAT composed of 3-mers can also be represented with sequence TATA. The number of unitigs will remain the same from one graph to another.

Is it possible to get the colors per k-mer in a parsable (non-binary) file format?

Yes, please see this solution

Benchmarking

Here are a few guidelines to benchmark Bifrost:

Troubleshooting

Assuming the header files (.h) are located at the path /usr/local/include/, the following command set the environment variables C_INCLUDE_PATH and CPLUS_INCLUDE_PATH correctly for the time of the session:

export C_INCLUDE_PATH=$C_INCLUDE_PATH:/usr/local/include/
export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/usr/local/include/

Assuming that libbifrost.(so|dylib|a) is located at the path /usr/local/lib/, the following command set the environment variables LD_LIBRARY_PATH, LIBRARY_PATH and PATH correctly for the time of the session:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib/
export LIBRARY_PATH=$LIBRARY_PATH:/usr/local/lib/
export PATH=$PATH:/usr/local/lib/

Citation

@article{holley2020bifrost,
   title="{Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs}",
   author={Holley, Guillaume and Melsted, P{\'a}ll},
   journal={Genome Biology},
   volume={21},
   article={249},
   year={2020}
}

Contact

For any question, feedback or problem, please feel free to file an issue on this GitHub repository and we will get back to you as soon as possible.

License