phac-nml / mpox-nf

A monkeypox virus (MPXV) assembly pipeline using Nextflow for host-removal, reference-based assembly, and generation of sequence quality metrics for Illumina paired-end data using a shotgun metagenomic sequencing approach.
Apache License 2.0
3 stars 4 forks source link

monkeypox-nf

A monkeypox virus (MPXV) assembly pipeline using Nextflow for host-removal, reference-based assembly, and generation of sequence quality metrics for Illumina paired-end data using a shotgun metagenomic sequencing approach.

Briefly, paired-end reads are mapped to a composite reference sequence composed of the Human hg38 and MPXV reference genome with BWA-MEM and assembled with the iVar consensus tool. The sequence quality metrics are generated with samtools and Nextclade v2.

A composite reference mapping approach was chosen as we aim to capture as many MPXV reads as possible to generate a consensus sequence while excluding ALL true human reads from the final de-hosted read dataset and consensus sequence. In this approach, human reads that weakly map to the MPXV reference or to the human reference genome are removed from the read dataset. This pipeline ensures that data can be shared and published with no risk to individual privacy. More information on how the dehosting approach was tested and the metrics found can be read in the Host Removal Report.

Note: Resource allocation is currently not optimized. We will be working to address that as soon as possible.


Index


Installation and Required Data

This pipeline is run using nextflow and has a variety of other dependencies to run the pipeline. Installation using conda is recommended.

All instructions are for a POSIX compatible system and has only been tested on Ubuntu 20.0.4+.

Installation Options:

Installation through Conda:

  1. Install Conda

    • Conda is an open source package management system and environment management system.
      
      # Get conda
      wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

    Install - follow instructions given

    bash Miniconda3-latest-Linux-x86_64.sh

  2. Create a nextflow conda environment with nextflow and mamba

    • Mamba is an open source reimplementation of the conda package manager in C++ with some improvements that should make environment creation quicker when creating the nextflow conda environment
      conda create -n nextflow -c conda-forge -c bioconda -c defaults nextflow mamba
  3. Optional - Setup a directory cache for conda environments

    • To avoid needing to make a new environment each time the pipeline is called, a directory can be set as a cache where nextflow will look for the environment before creating a new one
  4. Run the pipeline using the conda profile which will install the other pipeline dependencies

    conda activate nextflow
    nextflow run phac-nml/monkeypox-nf -profile conda --cache [Path/to/EnvironmentCacheDir]

Manual Installation if not using conda:

  1. Manually install nextflow

  2. Install ALL of the following dependencies:

  3. Optional installation of the following dependencies:

Required Data

Optional Data


Usage

Quick Start

Quick command that will run input data from start to finish with no additional options

General:

nextflow run phac-nml/monkeypox-nf --directory [/Path/to/PAIRED_FASTQ_DIR] --human_ref [/Path/to/HUMAN_REF_FASTA_File] 

Help Command and Exit:

nextflow run phac-nml/monkeypox-nf --help 

Inputs

Parameters

Parameter Description Default Optional
--directory Path to directory containing paired fastq reads None No
--human_ref Path to fasta formatted human reference genome None No
--outdir String name to call output results directory 'results' Yes
--mpx_ref Path to fasta formatted monkeypox reference genome 'data/NC_063383.fasta' Yes
--mpx_ref_id String name of reference monkeypox contig ID to keep during host removal 'NC_063383.1' Yes
--pcr_csv Path to CSV file containing qPCR diagnostic primers for nextclade input (must follow nextclade's format) 'data/nml_primers.csv' Yes
--cache Path to cache directory to store/reuse conda environments None Yes
--composite_bwa_index_dir Path to directory containing BWA indexed composite genome to skip indexing step None Yes
--metadata_csv Path to metadata CSV with atleast a column called 'sample' to add in additional metadata (like 'date') None Yes
--kraken_db Path to directory containing Kraken2 database - Runs Kraken2 on the generated dehosted fastq files None Yes
--kraken_viral_id String Kraken2 taxonomic identifier to use for the percent viral reads calculation. Default is the Integer ID for the Viruses domain 10239 Yes
--upload_config Path to IRIDA uploader config file (example below). Also requires a metadata CSV file input with columns 'sample', 'project_id', and 'sequencing_date' None Yes

Example Metadata CSV

  1. General CSV - Must have "sample" column

    sample Collection Location Collection Date ct value
    sample_1 swab 2022-06-10 23.12
    sample_2 lesion 2022-07-11 26.12
    pos_ctrl NA NA 18.3
    test_specimen NA NA 22.51
  2. Upload CSV - Must have "sample", "project_id", and "sequencing_date" columns

    sample Collection Location Collection Date ct value project_id sequencing_date
    sample_1 swab 2022-06-10 23.12 1041 2022-08-01
    sample_2 lesion 2022-07-11 26.12 1042 2022-08-01
    pos_ctrl NA NA 18.3 23 2022-08-01
    test_specimen NA NA 22.51 515 2022-07-29

Example IRIDA Upload Config

[Settings]
client_id = upload
client_secret = TsdafnJFDS3%4fds@fiLKjwK9932hiuHF
username = process_uploader
password = WeDfIejaC19*Y..1P013
base_url = http://an-irida-upload-url.ca/irida/api/
parser = directory

Outputs

Output FIles:

  1. samplesheet.csv

    • File containing the input names and file paths
    sample read1 read2 gzipped
    Sample1 /path/to/Sample1_S38_R1_001.fastq /path/to/Sample1_S38_R2_001.fastq False
    Sample2 /path/to/Sample1_S38_R1_001.fastq.gz /path/to/Sample1_S38_R2_001.fastq.gz True
  2. overall_sample_quality.csv

    • File containing combined sequencing metrics for each sample along with its nextclade output
    • Nextclade output information is available here. Some changes were made to colum headers and cell values including:
      • Column headers were renamed such that . were turned into _ and all are prefixed with nc_
      • , separators inside the cells were changed to ;
    sample num_reads_mapped mean_sequencing_depth median_sequencing_depth num_consensus_n genome_completeness nc_clade ...
    Sample1 7606 4.47005 6 194674 0.0128544 hMPXV-1 ...
    Sample2 320700 213.525 192 423 0.997855 hMPXV-1 ...
    • Non-nextclade columns are as follows:
      sample                  - [String] name of the sample
      num_reads_mapped        - [Int] number of total reads mapped, pulled from BAM file using samtools flagstats.
                              Reads are paired so the number of total sequence fragments can be obtained by dividing this value by 2
      mean_sequencing_depth   - [Float] mean genomic read depth, pulled from BAM file using samtools depth and awk
      median_sequencing_depth - [Int] median genomic read depth, pulled from BAM file using samtools depth and awk
      num_consensus_n         - [Int] number of positions that were not basecalled in the genome, from seqtk comp
      genome_completeness     - [Float] proportion of the genome where a base was called. Generated from seqtk comp and awk
      percent_viral_reads     - [Float] percentage of input reads that were identified as viral after host removal
                              kraken2 viral reads * 2
                             ------------------------- * 100
                                 total sample reads
  3. all_consensus directory

    • Directory containing just the output consensus sequences in fasta format
  4. Sample results directories

    • All samples run create subdirectories in the output results folder which contain the intermediate files used in the analysis

Example Output Directory Structure:

results
├── all_consensus
│   ├── Sample1.consensus.fa
│   ├── Sample2.consensus.fa
├── overall_sample_quality.csv
├── samplesheet.csv
├── Sample1
│   ├── Sample1.composite.mapped.sorted.bam
│   ├── Sample1.composite.mapped.sorted.bam.bai
│   ├── Sample1.consensus.fa
│   ├── Sample1_dehosted_R1.fastq
│   ├── Sample1_dehosted_R2.fastq
│   ├── Sample1.MPXVfiltered-qual0.bam
│   ├── Sample1.MPXVfiltered-qual0.flagstats.txt
│   ├── Sample1.MPXVfiltered-qual30.bam
│   ├── Sample1.MPXVfiltered-qual30.flagstats.txt
│   ├── Sample1_quality.csv
│   ├── Sample1_R1_fastqc.html
│   └── Sample1_R2_fastqc.html
└── Sample2
    ├── Sample2.composite.mapped.sorted.bam
    ├── Sample2.composite.mapped.sorted.bam.bai
    ├── Sample2.consensus.fa
    ├── Sample2_dehosted_R1.fastq
    ├── Sample2_dehosted_R2.fastq
    ├── Sample2.MPXVfiltered-qual0.bam
    ├── Sample2.MPXVfiltered-qual0.flagstats.txt
    ├── Sample2.MPXVfiltered-qual30.bam
    ├── Sample2.MPXVfiltered-qual30.flagstats.txt
    ├── Sample2_quality.csv
    ├── Sample2_R1_fastqc.html
    └── Sample2_R2_fastqc.html

Resources Needed

Resource usage and settings are important to run efficiently without wasting time. The below sections will give the minimal and default resources utilized along with an explanation on how to set custom resources.

The slowest step will almost always be the indexing of the composite genome which will take 1-2 hours and as it can be reused, there is an input, composite_bwa_index_dir, that can be given to skip this step once it has been done the first time.

If running with minimal resources or large input files, then the composite mapping step could/will take longer and should be the process to focus on giving proper resources to.

Minimum (but Not Recommended):

This pipeline can be run on the following minimum specs:

Note: Running minimal specs with paired fastq file sizes (R1+R2) greater than 5GB (about 5 million paired reads) will take a fairly long time.

See how to setup a resource config for this pipeline including the different labels that can be set below.

Minimum Recommended/Provided Resource Config:

The default setting for the pipeline requires:

To get an approximate idea on how long different sized input files (file size is R1+R2)should take to run the slowest step, bwa mem composite mapping, consult the following chart: approx_time_for_filesize

Setting Resource Config:

To create a custom resource file and utilize it in this pipeline, you can copy the resources.config file and modify the CPU and memory needs for default processes, medium processes (bwa-mem at the moment), and large processes (kraken2 at the moment) to whatever you wish along with potentially changing the executor or other values .

The labels available are: mediumProcess, and largeProcess

To utilize a custom config file, you can add -c /path/to/config to your nextflow run phac-nml/monkeypox-nf command.

For resources:

The number of cores significantly speeds up analysis on larger files.


Full Pipeline Process Details

1. Generate and Index Composite Reference

Generation of the composite reference is done through concatenation of the Human reference sequence with the MPXV reference sequence input.

The BWA index is then created with the command:

bwa index -a bwtsw COMPOSITE_REF

Indexing can be skipped if an index has been previously generated by using the -- /Path/to/CompositeIndexDir/ parameter. Make sure that the passed index is using the same human and MPXV reference sequence as you have input or the pipeline will not work.

2. Initial Fastq Quality Metrics

Fastq quality metrics are run on each input fastq file pair with FastQC with the command:

fastqc -t 2 READ1 READ2

This runs concurrently with the generation of the composite reference or mapping if configured to run more than one process at a time.

3. Map to Composite Reference

Mapping is done with bwa mem using the previously made composite genome fasta file and bwa index. The command is as follows:

bwa mem -t THREADS -T 30 COMPOSITE_REF READ1 READ2 | samtools sort --threads 10 -o SAMPLE_NAME.composite.mapped.sorted.bam

4. Filter BAM

There are two different filter commands used in the pipeline on the mapped reads differing in the mapping quality score used to filter reads (-q 0 or -q 30). Both are used as the > 0 mapping quality reads help sort out the ends of the genome and we only keep reads with > 30 mapping quality scores as this can help sort out certain problematic SNPs.

samtools view -b BAM_FILE 'MPXV_REF_ID' -q QUAL > FILTER_BAM

Note that only the > 0 mapping quality reads BAM file (MQ0) is used in subsequent steps but it is strongly recommended to visually review the pileup of any SNVs identified between the MQ0 and MQ30 files.

5. Generate Dehosted Fastq Files

Dehosted fastq files are generated from the filtered BAM file using samtools collate and fastq:

samtools collate -u -O FILTER_BAM | \
    samtools fastq -1 SAMPLE_NAME_dehosted_R1.fastq -2 SAMPLE_NAME_dehosted_R2.fastq -0 /dev/null -s /dev/null -n

6. Assemble MPXV Consensus Sequence

The final MPXV consensus sequence is made using samtools mpileup and ivar consensus:

samtools mpileup -Q 20 -a FILTER_BAM | ivar consensus -q 20 -t 0.7 -m 10 -p SAMPLE_NAME.consensus

7. Generate Sequence Quality Metrics

Sequence quality metrics are generated using the composite mapping file, filtered MQ0 BAM file and the consensus fasta with seqtk, samtools, and awk. Additional metrics are added from NextClade analysis of the consensus sequences using the Human Monkeypox (hMPXV) dataset. All metrics are joined by the sample name to generate the final CSV output.


Profiles Available

Currently three profiles available:

Profiles can be accessed with any analysis by adding in the argument -profile <NAME>


Authors and Acknowledgment

Thanks to all of the authors of the tools utilized in this pipeline. I've tried to include all relevant papers in the citations below. All tool links are available above in the Manual Installation section.

Citations

  1. Aksamentov, I., Roemer, C., Hodcroft, E. B., & Neher, R. A., (2021). Nextclade: clade assignment, mutation calling and quality control for viral genomes. Journal of Open Source Software, 6(67), 3773, https://doi.org/10.21105/joss.03773
  2. Grubaugh, N.D., Gangavarapu, K., Quick, J. et al. An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar. Genome Biol 20, 8 (2019). https://doi.org/10.1186/s13059-018-1618-7
  3. Li H. (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997v2 [q-bio.GN]
  4. Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li, Twelve years of SAMtools and BCFtools, GigaScience, Volume 10, Issue 2, February 2021, giab008, https://doi.org/10.1093/gigascience/giab008
  5. Wood, D.E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol 20, 257 (2019). https://doi.org/10.1186/s13059-019-1891-0

Support

Should any issues arise when running this pipeline, please create an issue for the moment.

Legal

Copyright Government of Canada 2022

Written by: National Microbiology Laboratory, Public Health Agency of Canada

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License at:

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.