kapsakcj / nanoporeWorkflow

:dna: Shell scripts for working with bacterial isolate Nanopore sequence data on CDC servers
MIT License
9 stars 3 forks source link

nanoporeWorkflow

Shell scripts and workflows for working with Nanopore data. Submits jobs to CDC's Aspen HPC using qsub.

:warning: Don't bother reading if you aren't working on CDC's servers :warning:

There are 2 main workflows:

TOC

Install

Download the repository from the latest release (v0.5.0 is latest as of March 2021) and uncompress.

$ wget https://github.com/kapsakcj/nanoporeWorkflow/archive/v0.5.0.tar.gz 
$ tar -xzf v0.5.0.tar.gz

Optional - add the workflows to your $PATH (edit the PATH below to wherever you downloaded the repo). Refresh your environment by source'ing your .bashrc file.

# Be careful with this command - make sure the PATH is properly edited!
$ echo 'export PATH=$PATH:/path/to/nanoporeWorkflow-0.5.0/workflows' >> ~/.bashrc
$ source ~/.bashrc

Workflows

Guppy GPU basecalling, demultiplexing, trimming, and NanoPlot

run_basecall-w-gpu.sh - Guppy GPU basecalling, demultiplexing, and adapter/barcode trimming. Followed by NanoPlot for generating seq run stats and graphs.

Requirements

This workflow does the following:

example: /path/to/nanoporeWorkflow-0.5.0/workflows/run_basecall-w-gpu.sh -i fast5s/ -o output/ -b y -f r941 -k rapid

EXAMPLE OUTPUT (reduced for brevity)

$OUTDIR ├── demux │   ├── barcode01 │   │   └── fastq_runid_fbc8eee46271cbe60ee8a49d0ca657f6e92e174e_0_0.fastq.gz (there will be many .fastq.gz files per barcode) │   ├── barcode02 │   │   └── fastq_runid_fbc8eee46271cbe60ee8a49d0ca657f6e92e174e_0_0.fastq.gz │   ├── barcode03 │   │   └── fastq_runid_fbc8eee46271cbe60ee8a49d0ca657f6e92e174e_0_0.fastq.gz │   ├── guppy-logs │   └── guppy_basecaller_log-2020-04-17_09-45-00.log (there will be many guppy-logs) │   ├── nanoplot │   └── NanoPlot-report.html # additionally all images and other files produced by NanoPlot │   ├── nanoplot-barcoded │   └── NanoPlot-report.html # additionally all images and other files produced by NanoPlot, but for each barcode │   ├── sequencing_summary.txt │   ├── sequencing_telemetry.js │   └── unclassified │   └── fastq_runid_fbc8eee46271cbe60ee8a49d0ca657f6e92e174e_0_0.fastq.gz └── log # qsub logs └── guppy.log └── nanoplot.log


### Assembly with Flye and polishing with Racon and Medaka

`workflow-after-gpu-basecalling.sh` - Assembly with Flye and polishing with Racon and Medaka

#### Requirements
  * Must have previously run the above workflow `run_basecall-w-gpu.sh`
  * Must be logged into a server with the ability to `qsub` (Aspen, Monoliths 1-3).
  * `OUTDIR` argument must be the same directory as the `OUTDIR` from the `run_basecall-w-gpu.sh` workflow

#### This workflow does the following:
  * Takes in 1 argument:
    1. `$outdir` - The output directory from running `run_basecall-w-gpu.sh`, which contain `demux/barcodeXX/` subdirectories
  * Prepares a barcoded sample - concatenates all fastq files into one, compresses, and counts read lengths
  * Runs `filtlong` to remove reads <500bp and downsample reads to 600 Mb (roughly 120X for a 5 Mb genome)
  * Assembles downsampled/filtered reads using `flye` (`--plasmids` and `-g 5M` options used)
  * Polishes flye draft assembly using racon 4 times
  * Polishes racon polished assembly using Medaka (specific to r9.4.1 flowcell, high accuracy basecaller model, and guppy version 3.6.x, `--m r941_min_high_g360` option used)
  * Final, polished assembly for each barcode can be found in each barcode subdirectory `demux/barcodeXX/final.asm.barcodeXX.fasta`

#### USAGE
Pull up help/usage statement by running `workflow-after-gpu-basecalling.sh` or `workflow-after-gpu-basecalling.sh -h`
```bash
# note: ensure that the outdir supplied in this command is the exact same as the outdir you
# supplied when you ran the run_basecall-w-gpu.sh script
Usage: /path/to/nanoporeWorkflow-0.5.0/workflows/workflow-after-gpu-basecalling.sh outdir/

This workflow runs the following on barcodes 01-24:

filtlong     removes reads <500bp and downsamples to 600Mb (roughly 120X for 5Mb genome)
flye         assembles reads. --plasmids and -g 5M options used
racon        polishes 4X with Racon
medaka       polishes once with Medaka using r9.4.1 pore and HAC guppy basecaller profile

# EXAMPLE OUTPUT - only showing one barcode for brevity
$OUTDIR/
├── demux
│   ├── barcode01
│   │   ├── all.fastq.gz
│   │   ├── flye
|   |   ├── final.asm.barcodeXX.fasta
│   │   ├── log  # qsub logs for each barcode
│   │   │   ├── assemble-d64ffbc5-4012-44c5-8191-1a57d4a7d15c.log
│   │   │   ├── polish-medaka-00e52c16-0bd3-460d-b955-3a532be958b1.log
│   │   │   ├── polish-racon-d7ebc124-d100-43e0-b347-1e60bbc0bf18.log
│   │   │   └── prepSample-7ecc6f51-4937-40d1-a6bd-d83e66078984.log
│   │   ├── medaka
│   │   ├── racon
│   │   ├── readlengths.txt.gz
│   │   └── reads.minlen1000.600Mb.fastq.gz
│   ├── guppy-logs
│   ├── nanoplot
│   ├── nanoplot-barcoded
│   ├── sequencing_summary.txt
│   ├── sequencing_telemetry.js
│   └── unclassified
└── log # qsub logs
    └── guppy.log
    └── nanoplot.log

Notes on assembly and polishing workflow

Contributing

If you are interested in contributing to nanoporeWorkflow, please take a look at the contribution guidelines. We welcome issues or pull requests!

Future plans

Resources