Discover and annotate the virome.
Works on your laptop or HPC (compatible with MacOS and Linux)
Cenote-Taker 3
is a virus bioinformatics tool that scales from individual genomes sequences to massive metagenome assemblies to:
1) Identify sequences containing genes specific to viruses (virus hallmark genes)
2) Annotate virus sequences including:
---a) adaptive ORF calling
---b) a large catalog of HMMs from virus gene families for functional annotation
---c) Hierarchical taxonomy assignment based on hallmark genes
---d) mmseqs2-based CDD database search
---e) tabular (.tsv) and interactive genome map (.gbf) outputs
Also, Cenote-Taker 3
is very fast, many many times faster than Cenote-Taker 2
for large datasets, and faster than comparable annotation using pharokka
with more function annotation for virus genes (in my hands)
Image of example genome map:
1) Discovering virus contigs in metagenomic data
2) Annotating virus sequences without highly similar well-annotated reference
3) Finding prophages (or proviruses) in microbial genomes
1) Not for read-level classification of known viruses (see Marker-MAGu or EsViritu for this task)
2) Not ideal for annotating virus genomes that are highly similar to known references (e.g. phage lambda with a few mutations).
Most recent versions
Cenote-Taker 3 scripts: v3.3.2
Cenote-Taker 3 Databases: v3.1.1
This should work on MacOS and Linux
Versions used in test installations
mamba 1.5.8
conda 24.7.1
mamba
is better/faster than conda
for almost all solving/installation tasks
1) Use mamba
to install the bioconda package
macOS (specify osx-64
platform regardless of which chip you have)
mamba create --platform osx-64 -n ct3_env -c conda-forge -c bioconda cenote-taker3=3.3.2
linux
mamba create -n ct3_env -c conda-forge -c bioconda cenote-taker3=3.3.2
2) Activate the conda environment.
conda activate ct3_env
You should be able to type cenotetaker3
and get_ct3_dbs
in terminal to bring up help menu now
3) Change to a directory where you'd like to install databases and run database script, specify DB directory with -o
.
Total DB file size of 3.0 GB after file decompression
cd ..
get_ct3_dbs -o ct3_DBs --hmm T --hallmark_tax T --refseq_tax T --mmseqs_cdd T --domain_list T
4) Set the database directory as a conda environmental variable.
conda env config vars set CENOTE_DBS=/path/to/ct3_DBs
1) Clone this GitHub repo
2) Using mamba
(package manager within conda
) and the provided yaml file, make the environment:
mamba env create -f Cenote-Taker3/environment/ct3_env.yaml
3) Activate the conda environment.
conda activate ct3_env
4) Change to repo and pip
install command line tool.
cd Cenote-Taker3
pip install .
You should be able to type cenotetaker3
and get_ct3_dbs
in terminal to bring up help menu now
5) Change to a directory where you'd like to install databases and run database script, specify DB directory with -o
.
Total DB file size of 3.0 GB after file decompression
cd ..
get_ct3_dbs -o ct3_DBs --hmm T --hallmark_tax T --refseq_tax T --mmseqs_cdd T --domain_list T
6) Set the database directory as a conda environmental variable.
conda env config vars set CENOTE_DBS=/path/to/ct3_DBs
Make sure conda environment is activated
cenotetaker3 -h
cenotetaker3 -c Cenote-Taker3/test_data/testcontigs_DNA_ct2.fasta -r test_ct3 -p T
cenotetaker3 -c my_metagenome_contigs.fna -r my_meta_ct3 -p T
cenotetaker3 -c my_metagenome_contigs.fna -r my_meta_ct3 -p T --lin_minimum_hallmark_genes 2
prodigal
(prodigal-gv
is default)cenotetaker3 -c my_metagenome_contigs.fna -r my_meta_ct3pr -p T --caller prodigal
cenotetaker3 -c my_virus_contigs.fna -r my_virs_ct3 -p F -am T
cenotetaker3 -c my_metagenome_contigs.fna -r my_meta_ct3 -p T -db virion rdrp dnarep
cenotetaker3 -c my_metagenome_contigs.fna -r my_meta_ct3 -p T --reads my_reads/*fastq
{run_title}/ | {run_title}_virus_summary.tsv <- main summary file for each virus | {run_title}_virus_sequences.fna <- all virus genome seqs | {run_title}_virus_AA.faa <- all virus AA seqs | {run_title}_prune_summary.tsv <- summary of pruning of each sequence | final_genes_to_contigs_annotation_summary.tsv <- annotation info, all genes | run_arguments.txt <- arguments used in this run │ {run_title}_cenotetaker.log <- main log file │ └───sequin_and_genome_maps/ │ │ {run_title}*gbf <- genome maps │ │ {run_title}*fsa <- genome sequence │ │ {run_title}*gtf <- feature table gtf format │ │ {run_title}*tbl <- feature table sequin format │ │ {run_title}*sqn <- non-human-readable sequin file for GenBank sub │ │ {run_title}*cmt <- sequin comment file │ └───ct_processing/ │ --- many intermediate files ---
CheckV for virus genome completeness estimation.
BACPHLIP for phage lifestyle prediction (only use complete/near-complete phage genomes).
VContact3 for genome clustering and taxonomy.
iPHoP for prokaryotic virus host prediction.
Cenote-Taker 3
is under active development, so please open an issue if anything seems unusual or any errors occur. It's likely that I've not tested every parameter combination, and bugs will be a simple fix.
Cenote-Taker 3
output