This repo is deprecated.
If you need help finishing a project using Cenote-Taker 2
/Unlimited Breadsticks
, I will still field questions/troubleshoot (open an issue).
Otherwise:
Please use Cenote-Taker 3. It's great!!
######### ######### ######### ######### ######### ######### ######### ######### #########
######### ######### ######### ######### ######### ######### ######### ######### #########
######### ######### ######### ######### ######### ######### ######### ######### #########
######### ######### ######### ######### ######### ######### ######### ######### #########
######### ######### ######### ######### ######### ######### ######### ######### #########
######### ######### ######### ######### ######### ######### ######### ######### #########
######### ######### ######### ######### ######### ######### ######### ######### #########
######### ######### ######### ######### ######### ######### ######### ######### #########
Consumers are presented a dillema at certain Italian-American eateries that provide "unlimited breadsticks" with every meal. On the one hand, the value of their dollar undoubtedly increases as more and more delicious breadsticks are eaten. On the other hand, overconsumption of this appetizer can preclude thorough enjoyment and completion of the forthcoming entree. Furthermore, stomach aches may ensue, and the health of the consumer may ultimately be jeopardized. Similarly, virus hunters employing high throughput virus discovery tools on dataset after dataset may feel the exciting buzz as fasta files full of putative virus sequences pile up. However, without inspection of new sequences, e.g. via visualization of genome maps (not to mention sequence dereplication), databases may become filled with garbage sequences. Therefore, it is perhaps ideal to do some manual curation after automated detection.
With that said, please enjoy Unlimited Breadsticks.
Unlimited Breadsticks uses probabilistic models (i.e. HMMs) of virus hallmark genes to identify virus sequences from any dataset of contigs (e.g. metagenomic assemblies) or genomes (e.g. bacterial genomes). Optionally, Unlimited Breadsticks will use gene content information to remove flanking cellular chromosomes from contigs representing putative prophages. Generally, the prophage-cellular chromosome boundary will be identified within 100 nt - 2000 nt of the actual location.
+ The code is currently functional. Feel free to consume Unlimited Breadsticks at will.
+ Minor update to handle very large contig files AND update to HMM databases on June 16th, 2021
Unlimited Breadsticks is derived from Cenote-Taker 2, but several time-consuming computations are skipped in order to analyze datasets as quickly as possible. Also, Unlimited Breadsticks only takes approximately 16 minutes to download and install (Cenote-Taker 2 takes about 2 hours due to large databases required for thorough sequence annotation). See installation instructions below.
To update from older versions:
conda activate unlimited_breadsticks_env
cd Cenote_Unlimited_Breadsticks
git pull
python update_ub_databases.py --hmm True
Compared to Cenote-Taker 2, there are a few limitations.
1) Unlimited Breadsticks does not do post-hallmark-gene-identification computations to flag plasmid and conjugative element sequences that occasionally slip through. 2) Unlimited Breadsticks does not make genome maps for manual inspection of putative viruses. 3) Contigs are not extensively annotated by Unlimited Breadsticks. No genome maps are created.
TOTAL INSTALLATION SIZE IS APPROXIMATELY 7 GB
conda -V
wget https://raw.githubusercontent.com/mtisza1/Cenote_Unlimited_Breadsticks/main/install_unlimited_breadsticks.sh
bash install_unlimited_breadsticks.sh 2>&1 | tee install_unlimited_breadsticks.log
(The "2>&1 | tee install_unlimited_breadsticks.log" part isn't necessary, but it will save the installation notes/errors to a log file)
That's it!
conda activate unlimited_breadsticks_env
python /path/to/Cenote_Unlimited_Breadsticks/unlimited_breadsticks.py
python /path/to/Cenote_Unlimited_Breadsticks/unlimited_breadsticks.py -c MY_CONTIGS.fasta -r my_contigs1_ub -m 32 -t 32 -p true -db virion
Options:
usage: unlimited_breadsticks.py [-h]
-c ORIGINAL_CONTIGS
-r RUN_TITLE
-p PROPHAGE
-m MEM
-t CPU
[--minimum_length_circular CIRC_LENGTH_CUTOFF]
[--minimum_length_linear LINEAR_LENGTH_CUTOFF]
[-db VIRUS_DOMAIN_DB]
[--lin_minimum_hallmark_genes LIN_MINIMUM_DOMAINS]
[--circ_minimum_hallmark_genes CIRC_MINIMUM_DOMAINS]
[--filter_out_plasmids FILTER_PLASMIDS]
unlimited_breadsticks is a pipeline for virus discovery and cursory annotation
of viral contigs and genomes. Visit
https://github.com/mtisza1/Cenote_Unlimited_Breadsticks for further
description
optional arguments:
-h, --help show this help message and exit
REQUIRED ARGUMENTS for unlimited_breadsticks :
-c ORIGINAL_CONTIGS, --contigs ORIGINAL_CONTIGS
Contig file with .fasta extension in fasta format - OR
- assembly graph with .fastg extension. Each header
must be unique before the first space character
-r RUN_TITLE, --run_title RUN_TITLE
Name of this run. A directory of this name will be
created. Must be unique from older runs or older run
will be renamed. Must be less than 18 characters,
using ONLY letters, numbers and underscores (_)
-p PROPHAGE, --prune_prophage PROPHAGE
True or False. -- Attempt to identify and remove
flanking chromosomal regions from non-circular contigs
with viral hallmarks (True is highly recommended for
sequenced material not enriched for viruses. Virus
enriched samples probably should be False (you might
check actaul enrichment with ViromeQC). Also, please
use False if --lin_minimum_hallmark_genes is set to 0)
-m MEM, --mem MEM example: 56 -- Gigabytes of memory available for
unlimited_breadsticks. Typically, 16 to 32 should be
used. Lower memory will work in theory, but could
extend the length of the run
-t CPU, --cpu CPU Example: 32 -- Number of CPUs available for
unlimited_breadsticks. Typically, 32 CPUs should be
used. For large datasets, increased performance can be
seen up to 120 CPUs. Fewer than 16 CPUs will work in
theory, but could extend the length of the run
OPTIONAL ARGUMENTS for unlimited_breadsticks. Most of which are important to consider!!! :
--minimum_length_circular CIRC_LENGTH_CUTOFF
Default: 1000 -- Minimum length of contigs to be
checked for circularity. Absolute minimun is 1000 nts
--minimum_length_linear LINEAR_LENGTH_CUTOFF
Default: 1000 -- Minimum length of non-circualr
contigs to be checked for viral hallmark genes.
-db VIRUS_DOMAIN_DB, --virus_domain_db VIRUS_DOMAIN_DB
default: virion -- 'standard' database: all virus (DNA
and RNA) hallmark genes (i.e. genes with known
function as virion structural, packaging, replication,
or maturation proteins specifically encoded by virus
genomes) with very low false discovery rate. 'virion'
database: subset of 'standard', hallmark genes
encoding virion structural proteins, packaging
proteins, or capsid maturation proteins (DNA and RNA
genomes). 'rna_virus' database: For RNA virus
hallmarks only. Includes RdRp and capsid genes of RNA
viruses. Low false discovery rate due to structural
similarity between RdRp genes and e.g. transposon-
encoded RT genes
--lin_minimum_hallmark_genes LIN_MINIMUM_DOMAINS
Default: 1 -- Number of detected viral hallmark genes
on a non-circular contig to be considered viral.
--circ_minimum_hallmark_genes CIRC_MINIMUM_DOMAINS
Default:1 -- Number of detected viral hallmark genes
on a circular contig to be considered viral.
--filter_out_plasmids FILTER_PLASMIDS
Default: True -- True - OR - False. If True, hallmark
genes of plasmids will not count toward the minimum
hallmark gene parameters. If False, hallmark genes of
plasmids will count. Plasmid hallmark gene set is not
necessarily comprehensive at this time.
Virus-like particle (VLP) prep assembly:
-p False -db standard
You might apply a size cutoff for linear contigs as well, e.g. --minimum_length_linear 3000
OR --minimum_length_linear 5000
. Changing length minima does not affect false positive rates, but short linear contigs may not be useful, depending on your goals.
Whole genome shotgun (WGS) metagenomic assembly:
-p True -db virion --minimum_length_linear 3000 --lin_minimum_hallmark_genes 2
While you should definitely definitely prune virus sequences from WGS datasets, CheckV also does a very good job (I'm still formally comparing these approaches) and you could use --prune_prophage False
and feed the unpruned contigs from Unlimited Breadsticks into checkv end_to_end
if you prefer.
Bacterial reference genome
-p True -db virion --minimum_length_linear 3000 --lin_minimum_hallmark_genes 2
Using --lin_minimum_hallmark_genes 1 --virus_domain_db virion
with WGS or bacterial genome data will (in my experience) yield very few sequences that appear to be false positives, however, there are lots of "degraded" prophage sequences in these sequencing sets, i.e. some/most genes of the phage have been lost. That said, sequence with just 1 hallmark gene is neither a guarantee of a degraded phage (especially in the case of ssDNA viruses) nor is 2+ hallmark a guarantee of of a complete phage.
RNAseq assembly of any kind (if you only want RNA viruses)
-p False -db rna_virus
If you also want DNA virus transcripts, or if your data is mixed RNA/DNA sequencing --virus_domain_db standard
is the appropriate option.
For all runs more CPUs will make this run faster. Mem GBs should be about 50% or more of the CPU value.
Michael J Tisza, Anna K Belford, Guillermo Domínguez-Huerta, Benjamin Bolduc, Christopher B Buck, Cenote-Taker 2 democratizes virus discovery and sequence annotation, Virus Evolution, Volume 7, Issue 1, January 2021, veaa100, https://doi.org/10.1093/ve/veaa100