rotary-genomics / rotary

Assembly/annotation workflow for Nanopore-based microbial genome data containing circular DNA elements.
BSD 3-Clause "New" or "Revised" License
3 stars 1 forks source link

rotary

rotary-logo-full

GitHub release DOI

Assembly/annotation workflow for Nanopore-based microbial genome data containing circular DNA elements

Quick start

Install

git clone https://github.com/rotary-genomics/rotary.git

conda env create -n rotary --file=rotary/enviroment.yaml

conda activate rotary

cd rotary

pip install --editable .

Run One Sample

mkdir output_dir
mkdir rotary_db_dir

cd output_dir

rotary run_one -l s1_long.fastq.gz -r1 s1_R1.fastq.gz -r2 s1_R2.fastq.gz -d ../rotary_db_dir

Note: Multiple samples can also be run in batch using the rotary init and rotary run commands.

Description

rotary is a snakemake pipeline that can be used to assemble single microbial genomes using standalone Nanopore data1 or hybrid Nanopore + short read data. The pipeline performs short read qc, short read decontamination, long read QC, assembly, end repair, polishing, contig rotation, and genome annotation.

Some advantages of using rotary:

Requirements

Usage

1. Install

git clone https://github.com/jmtsuji/rotary.git
conda env create -n rotary --file=rotary/enviroment.yaml
conda activate rotary
cd rotary
pip install --editable .

This install takes around 5 minutes on a 8-thread laptop.

2a. Run One Sample

mkdir output_dir
mkdir rotary_db_dir

cd output_dir

rotary run_one -l s1_long.fastq.gz -r1 s1_R1.fastq.gz -r2 s1_R2.fastq.gz -d ../rotary_db_dir

Note: If you are using older nanopore flow cells you should stop the run, modify the config file (see Advanced Usage below) and restart the run using the rotary run command.

2b. Run Multiple Samples

Rotary can target a directory containing numerous FASTQ files derived from various samples. It automatically organizes these files into sets corresponding to each sample and constructs a project directory that includes the necessary configuration files for executing the rotary run command. The rotary run command can then be used to run the workflow on an entire batch of samples.

mkdir output_dir
mkdir rotary_db_dir

rotary init -d rotary_db_dir -i sample_fastq_dir -o output_dir

cd output_dir

rotary run

Advanced Usage

Modifying the config (YAML) file (config.yaml).

Both rotary run_one and rotary init generate a config.yaml file in the output directory. Running rotary run in the directory will utilize this file for run parameters. The config file can be modified to change rotary's behavior.

Very Important Parameters:

There are other advanced parameters that you can also edit if you'd like.

Lastly, make sure you set the threads and memory to values that make sense for your server.

Post-run tips

To double check that everything ran correctly, I recommend to briefly check all files in the log and stats folders once the run is finished.

In addition, please check the following in each sample folder:

Demo dataset

As a simple demo, a hybrid sequencing dataset for an isolate of E. coli (strain WG1) can be run through the main assembly
portion of rotary (excluding the annotation module, which requires DBs that take a long time to download) in less than 1 hour, on a 8-thread laptop with 16 GB RAM. (The demo will need about 10 GB of storage space.)

The dataset used for the test is publicly available in NCBI BioProject PRJNA848777 and is related to the following publication:

Browning DF, Hobman, JL, Busby SJW. Laboratory strains of Escherichia coli K-12: things are seldom what they seem. Microbial Genomics 9, mgen000922 (2023). https://doi.org/10.1099%2Fmgen.0.000922

This demo dataset is not related to rotary at all, but it was a convenient choice due to the relatively compact dataset
size and because it includes both long and short read data. (Thanks to the authors for making their data publicly available!) Because the basecaller model used for the Nanopore data was not specified, the defaults for rotary (SUP model) are used in the test run below.

Demo code

# Download the read data associated with BioProject PRJNA848777
mkdir sample_fastq_dir
wget -O sample_fastq_dir/ecoli_R1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR211/087/SRR21124987/SRR21124987_1.fastq.gz
wget -O sample_fastq_dir/ecoli_R2.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR211/087/SRR21124987/SRR21124987_2.fastq.gz

# For the long read data, to get the qualtiy scores (i.e., not SRA Lite format), you will need to use the SRA Tookit
#   available here: https://github.com/ncbi/sra-tools/wiki (accessed 2023.12.20). You can then download the files using:
prefetch SRR21124986 && fasterq-dump -Z SRR21124986 | gzip > sample_fastq_dir/ecoli.fastq.gz
# Otherwise, if you are OK with quality scores stripped out (not what the test was run with), you can just using the following 
#  command to directly output the FastQ file (this command is commented out for clarity):
# wget -O sample_fastq_dir/ecoli.fastq.gz https://www.be-md.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRR21124986

# Initialize the output dir
mkdir output_dir
mkdir rotary_db_dir
rotary init -d rotary_db_dir -i sample_fastq_dir -o output_dir
# Modify the config file so that the human genome is not used for decontamination - this takes a long time and high RAM
# The final config line should be: contamination_references_ncbi_accessions: [GCF_000819615.1]

# Run rotary until just before the annotation module
cd output_dir
rotary run -s'--until circularize'

You can run this test without the -s'--until circularize' flag if you also want to annotate the resulting genome, but the run (particularly the install / download) will take be much longer.

Demo outputs

After the demo run, a directory called ecoli will be generated within the output dir. This ecoli directory contains the analysis files for the test sample.

Some of the key output files that will be generated in the ecoli dir are:

Citation

rotary is currently described in the methods of a bioRxiv pre-print. Please cite this pre-print if you use rotary:

Tsuji JM, Shaw NA, Nagashima S, Venkiteswaran JJ, Schiff SL, Watanabe T, Fukui M, Hanada S, Tank M, Neufeld JD (2023). Anoxygenic phototrophic Chloroflexota member uses a Type I reaction center. bioRxiv, DOI:10.1101/2020.07.07.190934

Final comments

Enjoy! I hope to continue to update/improve this pipeline (and remove some of the below caveats) over time, but for now, please feel free to use this basic working version.

Appendix

rotary workflow summary

  1. Perform short read QC (e.g., score, length, and adapter trimming) using bbduk.sh from bbmap
  2. Remove short read contamination using bbduk.sh from bbmap
  3. Perform simple long read QC using reformat.sh from bbmap
  4. Assemble long reads using Flye
  5. Perform end repair of any circular contigs (using a heavily modified version of the circlator workflow)
  6. Polish contigs using long reads via medaka
  7. If short reads are provided, polishes using Polypolish and POLCA
  8. Filters resulting contigs by a user-provided coverage threshold
  9. Rotates any circular contigs to start at a marker gene of your choice (dnaA by default), with help from circlator
  10. Performs one more round of polishing on circular contigs, either using Polypolish (if short reads were provided) or medaka (if only long reads were provided)
  11. Gene prediction via DFAST
  12. Functional and taxonomic annotation via EggNOG-mapper and GTDB-Tk

Known issues

Future to-do's and ideas

Footnotes

1 Currently, Nanopore data generated with R9.4.1 flow cells (or earlier) requires short reads for error correction. A recent paper demonstrated that microbial genome assemblies from R10.4 flow cells no longer benefit from short read error correction compared to just using long reads (very cool!). Thus, long-read only assembly and polishing might be sufficient if you have R10.4 (or higher) data - probably should be called with a super-high accuracy Guppy model.