rpetit3 / dragonflye

:dragon: :fly: Assemble bacterial isolate genomes from Nanopore reads
GNU General Public License v3.0
115 stars 10 forks source link

GitHub release (latest SemVer) Anaconda-Server Badge GitHub Gitpod ready-to-code

NOTE: This is under active development, any feedback will be very useful

dragonflye

:dragon: :fly: Assemble bacterial isolate genomes from Nanopore reads

A Quick Note

If you've worked with bacterial sequences, in all likelihood you have used one of Torsten Seemann's tools. One such tool is Shovill, which takes the bacterial genome assembly process and makes it quick and painless. Shovill was developed for paired-end Illumina reads, and there is a fork, shovill-se, which supports single-end reads.

Given the widespread usage of Shovill, and Torsten basically laying much of the groundwork, I decided to use Shovill as a framework for Dragonflye. Dragonflye can be considered a fork of Shovill that supports assembling Oxford Nanopore sequences. By going this route users will not have to relearn parameters, and will already be familiar with the outputs.

At this point, you might be wondering: so Robert you just hacked Shovill to work with ONT reads, why not just call it 'shovill-ont'?

That's because when I asked if there was interest in a "Shovill" for ONT reads, Curtis Kapsak (@kapsakcj) responded:

Curtis Kapsak (@kapsakcj): if wrapping flye , perhaps call it dragonflye (a very fast flye)?.

And, honestly how could I not go with that?!? It's an amazing play-on-words that I'm willing to bet Torsten would be proud of it!

So to sum it up, thank you Torsten for Shovill and providing a framework for Dragonflye.

Introduction

Dragonflye is a pipeline that aims to make assembling Oxford Nanopore reads quick and easy. Still working on the quick part, but I think the easy part is there. Dragonflye currently supports Flye, Miniasm and Raven assemblers, and Racon and Medaka polishers.

Main Steps

  1. Estimate genome size and read length from reads (unless --gsize provided) (kmc)
  2. Filter reads by length (default --minreadlength 1000) (Nanoq)
  3. Reduce FASTQ files to a sensible depth (default --depth 150) (rasusa)
  4. Remove adapters (requires --trim be given) (Porechop)
  5. Assemble with Flye, Miniasm, or Raven
  6. Polish assembly with Racon and/or Medaka
  7. Polish assembly with short reads via Polypolish and/or Pilon
  8. Remove contigs that are too short, too low coverage, or pure homopolymers
  9. Produce final FASTA with nicer names and parsable annotations
  10. Reorient contigs from final FASTA using dnaapler
  11. Output parsable assembly statistics (assembly-scan)

Quick Start

dragonflye --reads my-ont.fastq.gz --outdir dragonflye --gsize 5000000
... LOG TEXT ...
[dragonflye] Final assembly contigs: /home/robert_petit/repos/dragonflye/temp/dragonflye/contigs.fa
[dragonflye] It contains 3 (min=4864) contigs totalling 4939840 bp.
[dragonflye] Dragonfly fossils have been found with wingspans up to two feet (61cm)!
[dragonflye] Done.

ls dragonflye/
contigs.fa  contigs.gfa  dragonflye.log  flye-info.txt  flye.fasta

head -n4 dragonfly/contigs.fa
>contig00001 len=2753792 origname=Utg1024_LN:i:2753792_RC:i:486_XO:i:0 polish=none sw=dragonflye-raven/1.2.0 date=20231031
TTCTATTTATCAGTATCATTACTTTTATATTATCGATAATTAATCCGAACATATCATTAA
TCAAGTTATTATTCGAAGTGGTTTTGCTGCATTTGGAACAGTCGGGTTAAGTATGAACCT
TACCACAGAAGATAATAATGGTATTACTAAAATAATTATTATATTCGTTATGCTTTGCGG

head -n4 dragonfly/contigs.reoriented.fa
>contig00001 len=2753792 origname=Utg1024_LN:i:2753792_RC:i:486_XO:i:0 polish=none sw=dragonflye-raven/1.2.0 date=20231031 rotated=True
ATGTCGGAAAAAGAAATTTGGGAAAAGTGCTTGAAATTGCTCAAGAAAAATTATCAGCTG
TAAGTTACTCAACTTTCCTAAAAGATGACGAGGCTTTACACGATTAAAGATGGTGAAGCT
ATCGTATTATCGAGTATTCCTTTTAATGCAAATTGGTTAAATCAACAATATGCTGAAATT

Installation

Dragonflye is available from Bioconda. Dragonflye includes a lot of programs, so it can take conda a while to solve the environment. Because of this, I personally use Mamba to install it, because it's so much faster.

# With conda
conda create -n dragonflye -c conda-forge -c bioconda dragonflye

# With Mamba (much quicker)
mamba create -n dragonflye -c conda-forge -c bioconda dragonflye

Usage

Dragonflye - A very fast flye

SYNOPSIS
  De novo assembly pipeline for bacterial isolates with Nanopore reads
USAGE
  dragonflye [options] --outdir DIR --reads READS.fastq.gz
GENERAL
  --help          This help
  --version       Print version and exit
  --check         Check dependencies are installed
  --seed N        Random seed to use (default: 42)
INPUT
  --reads XXX     Input Nanopore FASTQ (default: '')
  --depth N       Sub-sample --reads to this depth. Disable with --depth 0 (default: 150)
  --minreadlen N  Minimum read length. Disable with --minreadlength 0 (default: 1000)
  --gsize XXX     Estimated genome size eg. 3.2M <blank=AUTODETECT> (default: '')
OUTPUT
  --outdir XXX    Output folder (default: '')
  --prefix XXX    Prefix to use for final assembly FASTA (default: 'contigs')
  --force         Force overwite of existing output folder (default: OFF)
  --minlen N      Minimum contig length <0=AUTO> (default: 500)
  --mincov n.nn   Minimum contig coverage <0=AUTO> (default: 2)
  --namefmt XXX   Format of contig FASTA IDs in 'printf' style (default: 'contig%05d')
  --keepfiles     Keep intermediate files (default: OFF)
RESOURCES
  --tmpdir XXX    Fast temporary directory (default: '')
  --cpus N        Number of CPUs to use (0=ALL) (default: 8)
  --ram n.nn      Try to keep RAM usage below this many GB (default: 16)
ASSEMBLER
  --assembler XXX Assembler: raven miniasm flye (default: 'flye')
  --opts XXX      Extra assembler options in quotes eg. flye: '--interations' (default: '')
  --nanohq        For Flye, use '--nano-hq' instead of --nano-raw (default: OFF)
POLISHER
  --racon N       Number of polishing rounds to conduct with Racon (default: 1)
  --medaka N      Number of polishing rounds to conduct with Medaka (requires --model) (default: 0)
  --model XXX     The model to be used by Medaka, (Assumes 1 polishing round, if --medaka not used) (default: '')
  --list_models   List the models available to Medaka (default: OFF)
SHORT-READ POLISHER
  --polypolish N  Number of polishing rounds to conduct with Polypolish (requires --R1 and --R2) (default: 1)
  --polypolish_careful Polypolish will ignore any reads with multiple alignments (default: OFF)
  --pilon N       Number of polishing rounds to conduct with Pilon (requires --R1 and --R2) (default: 0)
  --R1 XXX        Read 1 FASTQ to use for polishing (default: '')
  --R2 XXX        Read 2 FASTQ to use for polishing (default: '')
REORIENT
  --noreorient    Disable contig reorientation using dnaapler (default: OFF)
  --dnaapler_mode XXX The mode of reorientation to execute (default: 'all')
  --dnaapler_opts XXX Extra dnaapler options in quotes eg. '--evalue 1e-5' (default: '')
MODULES
  --trim          Enable adaptor trimming (default: OFF)
  --trimopts XXX  Extra porechop options in quotes eg. '--adapter_threshold 80' (default: '')
  --nofilter      Disable read length filtering (default: OFF)
  --nopolish      Disable assembly polishing (default: OFF)
HOMEPAGE
  https://github.com/rpetit3/dragonflye - Robert A Petit III

--depth

Giving an assembler too much data is a bad thing. There comes a point where you are no longer adding new information (as the genome is a fixed size), and only adding more noise (sequencing errors). Because of this Dragonflye will downsample your FASTQ files to a specific depth (defaults to 150x). It estimates depth by dividing read yield by genome size.

--gsize

The genome size is needed to estimate depth and for the assembly stage. If you don't provide --gsize, it will be estimated via k-mer frequencies using kmc. It doesn't need to be a perfect estimate, just in the right ballpark. If you know the genome size it is usually better then the estimate, and will save some time.

--keepfiles

This will keep all the intermediate files in --outdir so you can explore and debug.

--cpus

By default it will attempt to use all available CPU cores.

--ram

Dragonflye will do its best to keep memory usage below this value, but it is not guaranteed. If you are on a HPC cluster, you should make sure you tell your job submission engine a value higher than this.

--assembler

By default it will use FlyeA.

--opts

If you want to provide some assembler-specific parameters you can use the --opts parameter. Make sure you quote the parameters so they get passed as a single string eg. For --assembler flye you might use --opts "--iterations 4 --plasmids".

--racon & --medaka

These two parameters adjust how many polishing rounds are conducted per-polisher. For example, --racon 2 would conduct 2 rounds of polishing with Racon. If --medaka is provided, a model must also be provided with --model.

--model

A valid basecaller model must be provided with --model. If a valid model is provided, but --medaka was not provided it will assume --medaka 1.

--list_models

This will list all basecaller models that are avialable in Medaka.

--polypolish & --pilon & --R1 & --R2

If Illumina short-reads are provided, polishing will be done with Polypolish and/or Pilon. The value of --polypolish (Default 1) is the number of polishing rounds that will be conducted. By default Pilon is turned off.

Choosing which stages to use

Stage Enable Disable
Genome size estimation default --gsize INT
Read subsampling --depth INT --depth 0
Read length filtering default --nofilter
Adapter Trimming --trim default

Environment variables recognised

These env-vars will be used as defaults instead of the built-in defaults. You can use the normal command line option to override them still.

Variable Option Default
$DRAGONFLYE_CPUS --cpus 8
$DRAGONFLYE_RAM --ram 16
$DRAGONFLYE_ASSEMBLER --assembler flye
$TMPDIR --tmpdir /tmp

Output Files

Filename Description
contigs.fa The final assembly you should use
contigs.reoriented.fa If available, a reorientation of the final assembly
contigs.dnaapler.summary.tsv If available, a summary description of reoriented contigs
contigs.gfa Assembly graph
dragonflye.log Full log file for bug reporting
flye.fasta Raw assembly (flye)
flye-info.txt Information about contigs output by Flye
miniasm.fasta Raw assembly (miniasm)
raven.fasta Raw assembly (raven)

FAQ

Feedback

Please file questions, bugs or ideas to the Issue Tracker

Acknowledgements

I would like to personally extend my many thanks and gratitude to the authors of these software packages. Really, thank you very much!

Software Included (19)

Author

Funding

Support for this project came from the Wyoming Public Health Laboratory.

WPHL