Assembly/annotation workflow for Nanopore-based microbial genome data containing circular DNA elements
git clone https://github.com/rotary-genomics/rotary.git
conda env create -n rotary --file=rotary/enviroment.yaml
conda activate rotary
cd rotary
pip install --editable .
mkdir output_dir
mkdir rotary_db_dir
cd output_dir
rotary run_one -l s1_long.fastq.gz -r1 s1_R1.fastq.gz -r2 s1_R2.fastq.gz -d ../rotary_db_dir
Note: Multiple samples can also be run in batch using the rotary init
and rotary run
commands.
rotary is a snakemake pipeline that can be used to assemble single microbial genomes using standalone Nanopore data1 or hybrid Nanopore + short read data. The pipeline performs short read qc, short read decontamination, long read QC, assembly, end repair, polishing, contig rotation, and genome annotation.
miniconda
git clone https://github.com/jmtsuji/rotary.git
conda env create -n rotary --file=rotary/enviroment.yaml
conda activate rotary
cd rotary
pip install --editable .
This install takes around 5 minutes on a 8-thread laptop.
mkdir output_dir
mkdir rotary_db_dir
cd output_dir
rotary run_one -l s1_long.fastq.gz -r1 s1_R1.fastq.gz -r2 s1_R2.fastq.gz -d ../rotary_db_dir
Note: If you are using older nanopore flow cells you should stop the run, modify the config file
(see Advanced Usage below) and restart the run using the rotary run
command.
Rotary can target a directory containing numerous FASTQ files derived from various samples.
It automatically organizes these files into sets corresponding to each sample and constructs a project
directory that includes the necessary configuration files for executing the rotary run
command. The
rotary run
command can then be used to run the workflow on an entire batch of samples.
mkdir output_dir
mkdir rotary_db_dir
rotary init -d rotary_db_dir -i sample_fastq_dir -o output_dir
cd output_dir
rotary run
config.yaml
).Both rotary run_one
and rotary init
generate a config.yaml
file in the output directory. Running rotary run
in
the directory will utilize this file for run parameters. The config file can be modified to change rotary's behavior.
Very Important Parameters:
flye_input_mode
: set to "nano-hq" if your reads have <= 5% error rate; otherwise use "nano-raw"medaka_model
: set this parameter to match the flow cell version / basecalling model you used to generate the long
reads. See details in the Medaka Github repo's "Models" sectionThere are other advanced parameters that you can also edit if you'd like.
Lastly, make sure you set the threads and memory to values that make sense for your server.
To double check that everything ran correctly, I recommend to briefly check all files in the log
and stats
folders
once the run is finished.
In addition, please check the following in each sample folder:
logs/qc/qc_long.log
) -- shows how many reads were retained vs. discarded during the QC filter step (this
is technically in the logs
folder mentioned above, but I wanted to add it here again to stress that it is worth
checking)assembly/flye/[SAMPLE_ID]_assembly_info.txt
) -- see how many contigs you got and circular vs linear statusassembly/[SAMPLE_ID]_circular_info.tsv
) -- you can see if any circular contigs could not be
repaired successfully at their endspolish/cov_filter/[SAMPLE_ID]_filtered_contigs.list
) -- did anything get removed that you
wanted?circularize/identify/[SAMPLE_ID]_hmmsearch_hits.txt
and [SAMPLE_ID]_start_genes.ffn
in the same folder. Did you get the hits you expected? Were they appropriately filtered into
the .ffn
file?circularize/circlator/rotated.log
) -- was the start gene reasonably far off from the contig end
(so that the second round of polishing will actually improve things?)stats/circularize/polypolish_changes.log
) - there should be few to zero changes if everything went
smoothly. If you see more than about 20 changes, it means your genome might have some odd difficult-to-correct regions.As a simple demo, a hybrid sequencing dataset for an isolate of E. coli (strain WG1) can be run through the main assembly
portion of rotary (excluding the annotation module, which requires DBs that take a long time to download) in less than 1
hour, on a 8-thread laptop with 16 GB RAM. (The demo will need about 10 GB of storage space.)
The dataset used for the test is publicly available in NCBI BioProject PRJNA848777 and is related to the following publication:
Browning DF, Hobman, JL, Busby SJW. Laboratory strains of Escherichia coli K-12: things are seldom what they seem. Microbial Genomics 9, mgen000922 (2023). https://doi.org/10.1099%2Fmgen.0.000922
This demo dataset is not related to rotary at all, but it was a convenient choice due to the relatively compact dataset
size and because it includes both long and short read data. (Thanks to the authors for making their data publicly available!)
Because the basecaller model used for the Nanopore data was not specified, the defaults for rotary (SUP model) are used in
the test run below.
# Download the read data associated with BioProject PRJNA848777
mkdir sample_fastq_dir
wget -O sample_fastq_dir/ecoli_R1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR211/087/SRR21124987/SRR21124987_1.fastq.gz
wget -O sample_fastq_dir/ecoli_R2.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR211/087/SRR21124987/SRR21124987_2.fastq.gz
# For the long read data, to get the qualtiy scores (i.e., not SRA Lite format), you will need to use the SRA Tookit
# available here: https://github.com/ncbi/sra-tools/wiki (accessed 2023.12.20). You can then download the files using:
prefetch SRR21124986 && fasterq-dump -Z SRR21124986 | gzip > sample_fastq_dir/ecoli.fastq.gz
# Otherwise, if you are OK with quality scores stripped out (not what the test was run with), you can just using the following
# command to directly output the FastQ file (this command is commented out for clarity):
# wget -O sample_fastq_dir/ecoli.fastq.gz https://www.be-md.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRR21124986
# Initialize the output dir
mkdir output_dir
mkdir rotary_db_dir
rotary init -d rotary_db_dir -i sample_fastq_dir -o output_dir
# Modify the config file so that the human genome is not used for decontamination - this takes a long time and high RAM
# The final config line should be: contamination_references_ncbi_accessions: [GCF_000819615.1]
# Run rotary until just before the annotation module
cd output_dir
rotary run -s'--until circularize'
You can run this test without the -s'--until circularize'
flag if you also want to annotate the resulting genome, but
the run (particularly the install / download) will take be much longer.
After the demo run, a directory called ecoli
will be generated within the output dir. This ecoli
directory contains
the analysis files for the test sample.
Some of the key output files that will be generated in the ecoli
dir are:
ecoli/circularize/ecoli_circularize.fasta
is the final polished/rotated assembly (should be 2 contigs)ecoli/assembly/flye/ecoli_assembly_info.txt
includes details about the original assembled contigs. There should be
2 contigs, both circular, with one about 4.67 Mb in length and the other about 67.4 kb in length.ecoli/polish/cov_filter/ecoli_short_read_coverage.tsv
shows the short read coverage of all assembled contigs;
whereas ecoli/polish/cov_filter/ecoli_filtered_contigs.list
shows which contigs were ultimately retained after
cleaning out poor coverage contigs. Both contigs should be retained and should have short read coverages of about
48 (contig_1) and 83 (contig_2)ecoli/stats/polish/polypolish_changes.log
and ecoli/stats/circularize/polypolish_changes.log
show the changes
made after the 1st and 2nd rounds of short read polishing. There should only be a handful of changes after the 2nd round.rotary is currently described in the methods of a bioRxiv pre-print. Please cite this pre-print if you use rotary:
Tsuji JM, Shaw NA, Nagashima S, Venkiteswaran JJ, Schiff SL, Watanabe T, Fukui M, Hanada S, Tank M, Neufeld JD (2023). Anoxygenic phototrophic Chloroflexota member uses a Type I reaction center. bioRxiv, DOI:10.1101/2020.07.07.190934
Enjoy! I hope to continue to update/improve this pipeline (and remove some of the below caveats) over time, but for now, please feel free to use this basic working version.
1 Currently, Nanopore data generated with R9.4.1 flow cells (or earlier) requires short reads for error correction. A recent paper demonstrated that microbial genome assemblies from R10.4 flow cells no longer benefit from short read error correction compared to just using long reads (very cool!). Thus, long-read only assembly and polishing might be sufficient if you have R10.4 (or higher) data - probably should be called with a super-high accuracy Guppy model.