.. image:: https://img.shields.io/badge/License-GPLv3-blue.svg :target: https://opensource.org/licenses/GPL-3.0 :alt: GPLv3 License
.. contents:: :local: :depth: 2
Introduction ^^^^^^^^^^^^
Bolotie is a new algorithm designed to conduct mutational analysis and to detect recombinant forms and other anomalies among a very large set of viral sequences. methods implemented in Bolotie are designed such that novel sequences can also be analyzed efficiently without the need to rerun the entire protocol.
Varabyou, A., Pockrandt, C., Salzberg, S. L., & Pertea, M. (2021). Rapid detection of inter-clade recombination in SARS-CoV-2 with Bolotie. Genetics, 218(3)
, iyab074.
Installation ^^^^^^^^^^^^
Building from source """"""""""""""""""""
If you want to build it from source, we recommend cloning the git repository as shown below.
::
$ git clone https://github.com/salzberg-lab/bolotie.git --recursive
$ cd bolotie
$ cmake -DCMAKE_BUILD_TYPE=Release .
$ make -j4
If you are using a very old version of Git (< 1.6.5) the flag --recursive
does not exist.
In this case you need to clone the submodule separately (git submodule update --init --recursive
).
Requirements
Operating System GNU/Linux
Architecture
Intel/AMD platforms that support POPCNT
Compiler GCC ≥ 4.9, Clang ≥ 3.8
Build system CMake ≥ 3.0
Language support C++14
Python3 requirements :: $ pip install -r requirements.txt
R dependencies: ape plyr
Getting started
^^^^^^^^^^^^^^^
A wrapper script run.py
is provided to run all steps of the protocol.
This script, while convenient, is intended for replicability and testing and lacks some of the available features of Bolotie.
Due to additional external dependencies, tree-building and plotting is not included as part of the script but the tree plotting script
implemented in R is also provided and described below. Lastly, the clade assignment .tsv file requires clades IDs to be listed as unique integer IDs in incremental order.
The plotting function only supports upto 8 clades but Bolotie.
Arguments:
-h, --help show this help message and exit
--input INPUT File containing all assembled genomes in FASTA format
--reference REFERENCE
File containing the reference genome in FASTA format
--outdir OUTDIR Output directory in which all output and temporary
data will be stored
--keep_tmp If enabled - temporary data will not be removed (default=False)
--threads THREADS Number of threads to use for parallelized sections of
the analysis. (default = 1)
--freq-threshold FREQ_THRESHOLD
Variant frequency threshold. (default = 100)
--trim-len TRIM_LEN Number of bases at 3' and 5' ends of each sequence kept same as the reference for all sequences. (default = 200)
--clusters CLUSTERS File with sequence to cluster assignments in a tsv
format:
The script completes by creating several important files in the output directory:
A plotting function plot_utree.R
is used to produce a plot of an unrooted tree colored by clades. The script also labels any anomalous sequences detected by Bolotie.
To run the script use the following command:
::
$ plot_utree.R treefile.nwk clusterfile.tsv palette.tsv output.png paths.
Treefile.nwk needs to be computed separately, but all other inputs can be found in the output directory of the run.py
.
Data ^^^^
If you want to evaluate specific sequences for recombination events, you can download the probability table which we computed on 87,695 complete high-coverage genomes obtained from GISAID (September 2). This significantly reduces the running time from several hours to seconds.
ftp://ftp.ccb.jhu.edu/pub/data/bolotie_sars_cov_2/ <ftp://ftp.ccb.jhu.edu/pub/data/bolotie_sars_cov_2/>
_