A monkeypox virus (MPXV) assembly pipeline using Nextflow for host-removal, reference-based assembly, and generation of sequence quality metrics for Illumina paired-end data using a shotgun metagenomic sequencing approach.
Briefly, paired-end reads are mapped to a composite reference sequence composed of the Human hg38 and MPXV reference genome with BWA-MEM and assembled with the iVar consensus tool. The sequence quality metrics are generated with samtools and Nextclade v2.
A composite reference mapping approach was chosen as we aim to capture as many MPXV reads as possible to generate a consensus sequence while excluding ALL true human reads from the final de-hosted read dataset and consensus sequence. In this approach, human reads that weakly map to the MPXV reference or to the human reference genome are removed from the read dataset. This pipeline ensures that data can be shared and published with no risk to individual privacy. More information on how the dehosting approach was tested and the metrics found can be read in the Host Removal Report.
Note: Resource allocation is currently not optimized. We will be working to address that as soon as possible.
This pipeline is run using nextflow and has a variety of other dependencies to run the pipeline. Installation using conda is recommended.
All instructions are for a POSIX compatible system and has only been tested on Ubuntu 20.0.4+.
Install Conda
# Get conda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
Create a nextflow conda environment with nextflow and mamba
conda create -n nextflow -c conda-forge -c bioconda -c defaults nextflow mamba
Optional - Setup a directory cache for conda environments
nextflow
will look for the environment before creating a new oneRun the pipeline using the conda profile which will install the other pipeline dependencies
conda activate nextflow
nextflow run phac-nml/monkeypox-nf -profile conda --cache [Path/to/EnvironmentCacheDir]
Manually install nextflow
Install ALL of the following dependencies:
Optional installation of the following dependencies:
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
gunzip GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
Standard
database to get a full report of what is remaining in your sampleVirus
database should also work but will not give any further insight into the other reads in your sampleQuick command that will run input data from start to finish with no additional options
nextflow run phac-nml/monkeypox-nf --directory [/Path/to/PAIRED_FASTQ_DIR] --human_ref [/Path/to/HUMAN_REF_FASTA_File]
nextflow run phac-nml/monkeypox-nf --help
Parameter | Description | Default | Optional |
---|---|---|---|
--directory | Path to directory containing paired fastq reads | None | No |
--human_ref | Path to fasta formatted human reference genome | None | No |
--outdir | String name to call output results directory | 'results' | Yes |
--mpx_ref | Path to fasta formatted monkeypox reference genome | 'data/NC_063383.fasta' | Yes |
--mpx_ref_id | String name of reference monkeypox contig ID to keep during host removal | 'NC_063383.1' | Yes |
--pcr_csv | Path to CSV file containing qPCR diagnostic primers for nextclade input (must follow nextclade's format) | 'data/nml_primers.csv' | Yes |
--cache | Path to cache directory to store/reuse conda environments | None | Yes |
--composite_bwa_index_dir | Path to directory containing BWA indexed composite genome to skip indexing step | None | Yes |
--metadata_csv | Path to metadata CSV with atleast a column called 'sample' to add in additional metadata (like 'date') | None | Yes |
--kraken_db | Path to directory containing Kraken2 database - Runs Kraken2 on the generated dehosted fastq files | None | Yes |
--kraken_viral_id | String Kraken2 taxonomic identifier to use for the percent viral reads calculation. Default is the Integer ID for the Viruses domain | 10239 | Yes |
--upload_config | Path to IRIDA uploader config file (example below). Also requires a metadata CSV file input with columns 'sample', 'project_id', and 'sequencing_date' | None | Yes |
General CSV - Must have "sample" column
sample | Collection Location | Collection Date | ct value |
---|---|---|---|
sample_1 | swab | 2022-06-10 | 23.12 |
sample_2 | lesion | 2022-07-11 | 26.12 |
pos_ctrl | NA | NA | 18.3 |
test_specimen | NA | NA | 22.51 |
Upload CSV - Must have "sample", "project_id", and "sequencing_date" columns
sample | Collection Location | Collection Date | ct value | project_id | sequencing_date |
---|---|---|---|---|---|
sample_1 | swab | 2022-06-10 | 23.12 | 1041 | 2022-08-01 |
sample_2 | lesion | 2022-07-11 | 26.12 | 1042 | 2022-08-01 |
pos_ctrl | NA | NA | 18.3 | 23 | 2022-08-01 |
test_specimen | NA | NA | 22.51 | 515 | 2022-07-29 |
[Settings]
client_id = upload
client_secret = TsdafnJFDS3%4fds@fiLKjwK9932hiuHF
username = process_uploader
password = WeDfIejaC19*Y..1P013
base_url = http://an-irida-upload-url.ca/irida/api/
parser = directory
samplesheet.csv
sample | read1 | read2 | gzipped |
---|---|---|---|
Sample1 | /path/to/Sample1_S38_R1_001.fastq | /path/to/Sample1_S38_R2_001.fastq | False |
Sample2 | /path/to/Sample1_S38_R1_001.fastq.gz | /path/to/Sample1_S38_R2_001.fastq.gz | True |
overall_sample_quality.csv
.
were turned into _
and all are prefixed with nc_
,
separators inside the cells were changed to ;
sample | num_reads_mapped | mean_sequencing_depth | median_sequencing_depth | num_consensus_n | genome_completeness | nc_clade | ... |
---|---|---|---|---|---|---|---|
Sample1 | 7606 | 4.47005 | 6 | 194674 | 0.0128544 | hMPXV-1 | ... |
Sample2 | 320700 | 213.525 | 192 | 423 | 0.997855 | hMPXV-1 | ... |
sample - [String] name of the sample
num_reads_mapped - [Int] number of total reads mapped, pulled from BAM file using samtools flagstats.
Reads are paired so the number of total sequence fragments can be obtained by dividing this value by 2
mean_sequencing_depth - [Float] mean genomic read depth, pulled from BAM file using samtools depth and awk
median_sequencing_depth - [Int] median genomic read depth, pulled from BAM file using samtools depth and awk
num_consensus_n - [Int] number of positions that were not basecalled in the genome, from seqtk comp
genome_completeness - [Float] proportion of the genome where a base was called. Generated from seqtk comp and awk
percent_viral_reads - [Float] percentage of input reads that were identified as viral after host removal
kraken2 viral reads * 2
------------------------- * 100
total sample reads
all_consensus directory
Sample results directories
results
├── all_consensus
│ ├── Sample1.consensus.fa
│ ├── Sample2.consensus.fa
├── overall_sample_quality.csv
├── samplesheet.csv
├── Sample1
│ ├── Sample1.composite.mapped.sorted.bam
│ ├── Sample1.composite.mapped.sorted.bam.bai
│ ├── Sample1.consensus.fa
│ ├── Sample1_dehosted_R1.fastq
│ ├── Sample1_dehosted_R2.fastq
│ ├── Sample1.MPXVfiltered-qual0.bam
│ ├── Sample1.MPXVfiltered-qual0.flagstats.txt
│ ├── Sample1.MPXVfiltered-qual30.bam
│ ├── Sample1.MPXVfiltered-qual30.flagstats.txt
│ ├── Sample1_quality.csv
│ ├── Sample1_R1_fastqc.html
│ └── Sample1_R2_fastqc.html
└── Sample2
├── Sample2.composite.mapped.sorted.bam
├── Sample2.composite.mapped.sorted.bam.bai
├── Sample2.consensus.fa
├── Sample2_dehosted_R1.fastq
├── Sample2_dehosted_R2.fastq
├── Sample2.MPXVfiltered-qual0.bam
├── Sample2.MPXVfiltered-qual0.flagstats.txt
├── Sample2.MPXVfiltered-qual30.bam
├── Sample2.MPXVfiltered-qual30.flagstats.txt
├── Sample2_quality.csv
├── Sample2_R1_fastqc.html
└── Sample2_R2_fastqc.html
Resource usage and settings are important to run efficiently without wasting time. The below sections will give the minimal and default resources utilized along with an explanation on how to set custom resources.
The slowest step will almost always be the indexing of the composite genome which will take 1-2 hours and as it can be reused, there is an input, composite_bwa_index_dir
, that can be given to skip this step once it has been done the first time.
If running with minimal resources or large input files, then the composite mapping step could/will take longer and should be the process to focus on giving proper resources to.
This pipeline can be run on the following minimum specs:
Note: Running minimal specs with paired fastq file sizes (R1+R2) greater than 5GB (about 5 million paired reads) will take a fairly long time.
See how to setup a resource config for this pipeline including the different labels that can be set below.
The default setting for the pipeline requires:
To get an approximate idea on how long different sized input files (file size is R1+R2)should take to run the slowest step, bwa mem composite mapping, consult the following chart:
To create a custom resource file and utilize it in this pipeline, you can copy the resources.config
file and modify the CPU and memory needs for default processes, medium processes (bwa-mem at the moment), and large processes (kraken2 at the moment) to whatever you wish along with potentially changing the executor or other values .
The labels available are: mediumProcess
, and largeProcess
To utilize a custom config file, you can add -c /path/to/config
to your nextflow run phac-nml/monkeypox-nf
command.
For resources:
The number of cores significantly speeds up analysis on larger files.
Generation of the composite reference is done through concatenation of the Human reference sequence with the MPXV reference sequence input.
The BWA index is then created with the command:
bwa index -a bwtsw COMPOSITE_REF
Indexing can be skipped if an index has been previously generated by using the -- /Path/to/CompositeIndexDir/
parameter. Make sure that the passed index is using the same human and MPXV reference sequence as you have input or the pipeline will not work.
Fastq quality metrics are run on each input fastq file pair with FastQC with the command:
fastqc -t 2 READ1 READ2
This runs concurrently with the generation of the composite reference or mapping if configured to run more than one process at a time.
Mapping is done with bwa mem
using the previously made composite genome fasta file and bwa index. The command is as follows:
bwa mem -t THREADS -T 30 COMPOSITE_REF READ1 READ2 | samtools sort --threads 10 -o SAMPLE_NAME.composite.mapped.sorted.bam
There are two different filter commands used in the pipeline on the mapped reads differing in the mapping quality score used to filter reads (-q 0 or -q 30). Both are used as the > 0 mapping quality reads help sort out the ends of the genome and we only keep reads with > 30 mapping quality scores as this can help sort out certain problematic SNPs.
samtools view -b BAM_FILE 'MPXV_REF_ID' -q QUAL > FILTER_BAM
Note that only the > 0 mapping quality reads BAM file (MQ0) is used in subsequent steps but it is strongly recommended to visually review the pileup of any SNVs identified between the MQ0 and MQ30 files.
Dehosted fastq files are generated from the filtered BAM file using samtools collate and fastq:
samtools collate -u -O FILTER_BAM | \
samtools fastq -1 SAMPLE_NAME_dehosted_R1.fastq -2 SAMPLE_NAME_dehosted_R2.fastq -0 /dev/null -s /dev/null -n
The final MPXV consensus sequence is made using samtools mpileup and ivar consensus:
samtools mpileup -Q 20 -a FILTER_BAM | ivar consensus -q 20 -t 0.7 -m 10 -p SAMPLE_NAME.consensus
Sequence quality metrics are generated using the composite mapping file, filtered MQ0 BAM file and the consensus fasta with seqtk, samtools, and awk. Additional metrics are added from NextClade analysis of the consensus sequences using the Human Monkeypox (hMPXV) dataset. All metrics are joined by the sample name to generate the final CSV output.
Currently three profiles available:
Profiles can be accessed with any analysis by adding in the argument -profile <NAME>
-
and not --
. Pipeline parameters are accessed with --
whereas nextflow parameters are accessed with a -
Thanks to all of the authors of the tools utilized in this pipeline. I've tried to include all relevant papers in the citations below. All tool links are available above in the Manual Installation section.
Should any issues arise when running this pipeline, please create an issue for the moment.
Copyright Government of Canada 2022
Written by: National Microbiology Laboratory, Public Health Agency of Canada
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License at:
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.