This Nextflow pipeline is designed to process metagenomic sequencing data, characterize overall taxonomic composition, and identify and quantify reads mapping to human-infecting viruses. It was developed as part of the Nucleic Acid Observatory project.
The pipeline currently consists of two workflows, an index workflow and a run workflow. The former creates index and reference files that are used by the latter, and is intended to be run first, after which many instantiations of the latter workflow can use the same index output files.
The run workflow then consists of five phases:
A slight simplification of the overall process is given by this mildly baffling diagram:
The index workflow is run by setting mode = "index"
in the relevant config file, and is intended to be run once (per reasonable length of time) to generate static index and reference files which can then be used by many instantiations of the run workflow. Many of these index/reference files are derived from publicly available reference genomes or other resources, and should thus be updated and re-run periodically as new versions of these become available; however, to keep results comparable across datasets analyzed with the run workflow, this should be done relatively rarely.
Key inputs to the index workflow include:
k2_standard
database).[^carp]: The carp genome is included as a last-ditch attempt to capture any remaining Illumina adapter sequences before moving on to HV identification. I'm not especially confident this is helpful.
Given these inputs, the index workflow:
[^genbank]: Excluding transgenic, contaminated, or erroneous sequences, which are excluded according to a list of sequence ID patterns specified in the config file.
For more information, see workflows/index.nf
and the associated subworkflows and modules.
The run workflow begins by concatenating all libraries assigned to the same sample ID together[^concat], then optionally subsampling the concatenated read files down to a specified number of reads[^subsample]. The reads then undergo cleaning by Cutadapt, Trimmomatic, and FASTP, in that order; the first two of these only screen for adapter contamination, while the last both screens for adapters and trims low-quality and low-complexity sequences. Running three different adapter-trimming tools in succession is unusual, but necessary to minimize spurious identification of "human-viral" reads based on adapter contamination of the reference genomes.
[^concat]: This is controlled by the library file specied in the config file; any libraries with the same entry in the sample
column are concatenated. This is primarily useful in cases where the same library is sequenced multiple times, e.g. to reach some total target depth.
[^subsample]: This is mainly useful for an initial run of the pipeline to confirm everything functions well for a given dataset and sample sheet; it's not recommended for analyses that are actually expected to yield meaningful results.
Cleaned reads then undergo deduplication with Clumpify, a tool from the BBTools suite that identifies and collapses read pairs that are identical modulo some specified error rate. (Importantly, this deduplication phase does not remote reverse-complement duplicates.[^rc]) The deduplicated reads then undergo ribodepletion with BBDuk, which searches for ribosomal k-mers based on a SILVA sequencing database and removes any reads that exhibit rRNA matches above some threshold. This is performed twice in series, once with relatively strict parameters for what sequences qualify as "ribosomal" (which thus removes relatively fewer reads) and again with more inclusive parameters (which thus result in more reads being removed). The output of the initial, more conservative ribodepletion step is passed to the viral identification phase, while the output of the latter, more aggressive step is passed to the taxonomic profiling phase[^ribo].
[^rc]: Unfortunately, I'm not aware of any deduplication tool that (1) works well on paired-read data, (2) can detect and remove duplicates in the presence of sequencing errors, and (3) can handle reverse-complement duplicates. If you know of one, please do let me know! [^ribo]: The splitting of ribodepletion into these two phases arises from differences in the needs of the two downstream processes they feed into. For viral identification, we want to make sure we aren't losing any HV reads through spurious identification as ribosomal, so we use more conservative parameters. For taxonomic profiling, we're much less concerned about the status of individual reads and more concerned about accurately measuring ribosomal content, so a more aggressive approach is appropriate.
In the taxonomic profiling phase, paired-end reads output from the preprocessing phase are merged together into single sequences through a combination of BBMerge (which aligns and merges read pairs with significant overlap) and end-to-end concatenation (with an intervening "N" base, for those read pairs BBMerge is unable to merge). The resulting single sequences are passed to Kraken2 for taxonomic assignment, using the reference database obtained in the index workflow. The output of Kraken2 is then passed to Bracken for correction and summarization, and the outputs of both Kraken and Bracken are merged across samples to produce single output files.
This phase is actually run three times across the pipeline: once on the cleaned reads, once on deduplicated reads, and once on the output of the second ribodepletion step. The latter of these is the primary source of data for overall assessment of the composition of the dataset; the former two are primarily used to assess how deduplication alters the measured taxonomic composition.
The goal of this phase is to sensitively and specifically identify reads arising from human-infecting virus reads. It uses a multi-step process designed to avoid false-positive and false-negative errors arising from simpler systems tried previously.
Following HV status assignment, information for each read pair is processed into a TSV file available in the output directory as hv_hits_putative_collapsed.tsv.gz
. Finally, the number of read pairs mapping to each detected HV taxon is counted and output as hv_clade_counts.tsv.gz
for downstream use.
[^filter]: We've found in past investigations that the two aligners detect different contaminant sequences, and aligning against both is more effective at avoiding false positives than either in isolation.
[^threshold]: Specifically, Kraken-unassigned read pairs are classed as human-viral if, for either read in the pair, $S/ln(L) >= T$, where $S$ is the best-match Bowtie2 alignment score for that read, $L$ is the length of the read, and $T$ is the value of params.bt2_score_threshold
specified in the config file.
To evaluate the performance of the process described in the viral identification phase, it's useful to get some ground-truth information about whether the human-viral assignments made in that subworkflow are correct. To do this, we use BLASTN to align the putative HV reads output by the previous phase against the nt database, then process the output to check whether each sequence had a high-scoring alignment to at least one HV sequence. For computational tractability, this can be performed on only a subset of surviving HV reads (specified by params.blast_hv_fraction
)[^blast].
[^blast]: Setting params.blast_hv_fraction
to 0 skips this step altogether.
Currently, this subworkflow performs the BLAST alignment, filters & summarizes the output to a manageable size, and saves the result to the output directory as blast_hits_paired.tsv.gz
. Actual assessment of HV status and performance evaluation is currently performed external to this workflow.
Each step in the preprocessing phase generates paired FASTQ files that can be analyzed with FASTQC and MultiQC. Doing so allows us to track important metrics (e.g. read quality, read numbers, adapter content) across the pipeline. This phase takes the MultiQC outputs from each phase and extracts relevant metrics into easy-to-parse TSV files[^tsvs] for downstream processing.
In addition, for each iteration of the taxonomic classification workflow, this phase takes the Bracken output data and combines it with MultiQC metrics to compute the number and fraction of input reads assigned to each of the following categories (if applicable):
This composition information is given in a file named taxonomic_composition.tsv.gz
in the results directory.
[^tsvs]: Specifically, qc_basic_stats.tsv.gz
, qc_adapter_stats.tsv.gz
, qc_quality_base_stats.tsv.gz
and qc_quality_sequence_stats.tsv.gz
.
[^eukarya]: As human is the only eukaryotic genome included in the Standard reference database for Kraken2, all sequences assigned to that domain can be assigned to Homo sapiens.
If the pipeline runs to completion, the following output files are expected.
output/input/index_params.json
: JSON file giving all the parameters passed to the pipeline (useful for trying to reproduce someone else's results).output/results/nt
: Directory containing extracted BLAST database files for BLASTing against nt.output/results/bt2-hv-index
: Directory containing Bowtie2 index for human-infecting viral genomes.output/results/bt2-human-index
: Directory containing Bowtie2 index for the human genome.output/results/bt2-other-index
: Directory containing Bowtie2 index for other contaminant sequences.output/results/bbm-human-index
: Directory containing BBMap index for the human genome.output/results/bbm-other-index
: Directory containing BBMap index for other contaminant sequences.output/results/kraken_db
: Directory containing Kraken2 reference database (by default, the most recent version of Standard).output/results/human-viral-genomes-filtered.fasta.gz
: FASTA file containing human-viral genomes downloaded from viral Genbank (filtered to remove transgenic, contaminated, or erroneous sequences).output/results/genomeid-to-taxid.json
: JSON mapping between HV taxids and NCBI genome IDs for the sequences in (8).output/results/ribo-ref-concat.fasta.gz
: Reference database of ribosomal LSU and SSU sequences from SILVA.output/results/taxonomy-nodes.dmp
: Taxonomy dump file from NCBI mapping between taxids and their parents in the NCBI taxonomy tree structure.output/results/taxonomy-names.dmp
: Taxonomy dump file from NCBI mapping between taxids and taxon names.output/results/human-virus-db.tsv.gz
: Database generated from (8), (9), (12) and (13) giving, for each human-viral taxon:
taxid
)parent_taxid
)name
)rank
)output/results/total-virus-db.tsv.gz
: As (14), but for all viral taxa (including non-human-infecting ones).output/input
: Directory containing saved input information (useful for trying to reproduce someone else's results)
adapters.fasta
: FASTA file of adapter sequences used for adapter screening.params.json
: JSON file giving all the parameters passed to the pipeline.params.sample_tab
).output/intermediates
: Intermediate files produced by key stages in the run workflow, saved for nonstandard downstream analysis.
reads/cleaned
: Directory containing paired FASTQ files for cleaned reads (i.e. the output of the preprocessing phase described above).output/results
: Directory containing processed results files for standard downstream analysis.
hv/hv_hits_putative_collapsed.tsv.gz
: TSV output by the viral identification phase, giving information about each read pair assigned to a human-infecting virus.hv/hv_clade_counts.tsv.gz
: Summary of the previous file giving the number of HV read pairs mapped to each viral taxon. Includes both read pairs mapped directly to that taxon (n_reads_direct
) and to that taxon plus all descendent taxa (n_reads_clade
).hv/blast_hits_paired.tsv.gz
: Summarized BLASTN output for putative HV read pairs, giving, for each read pair and subject taxid:
n_reads
).bitscore_max
and bitscore_min
)[^bitscore].qc/qc_basic_stats.tsv.gz
: Summary statistics for each sample at each stage of the preprocessing phase (stage
), including:
percent GC
);mean_seq_len
);n_read pairs
);n_bases_approx
);percent_duplicates
);qc/qc_adapter_stats.tsv.gz
: Adapter statistics calculated by FASTQC for each sample and preprocessing stage, given as a percentage of reads containing adapter content (pc_adapters
) at each position along the read (position
) for each adapter detected (adapter
) for each read in the read pair (read_pair
).qc/qc_quality_base_stats.tsv.gz
: Per-base read-quality statistics calculated by FASTQC for each sample and preprocessing stage, given as the mean Phred score (mean_phred_score
) at each position along the read (position
) for each read in the read pair (read_pair
).qc/qc_quality_sequence_stats.tsv.gz
: Per-sequence read-quality statistics calculated by FASTQC for each sample and preprocessing stage, given as the number of reads (n_sequences
) with a given mean Phred score (mean_phred_score
) for each read in the read pair (read_pair
).taxonomy_final/taxonomic_composition.tsv.gz
: High-level classification of input reads for each sample by preprocessing stage and taxonomic domain, based on taxonomic analysis of the complete workflow.taxonomy_final/kraken_reports.tsv.gz
: Kraken output reports in TSV format, labeled by sample, based on taxonomic analysis of the complete workflow.taxonomy_pre_dedup/taxonomic_composition.tsv.gz
: High-level classification of input reads for each sample by preprocessing stage and taxonomic domain, based on taxonomic analysis of the cleaned reads (i.e. without deduplication or ribodepletion).taxonomy_pre_dedup/kraken_reports.tsv.gz
: Kraken output reports in TSV format, labeled by sample, based on taxonomic analysis of the cleaned reads (i.e. without deduplication or ribodepletion).taxonomy_post_dedup/taxonomic_composition.tsv.gz
: High-level classification of input reads for each sample by preprocessing stage and taxonomic domain, based on taxonomic analysis of the deduplicated reads (i.e. without ribodepletion).taxonomy_post_dedup/kraken_reports.tsv.gz
: Kraken output reports in TSV format, labeled by sample, based on taxonomic analysis of the deduplicated reads (i.e. without ribodepletion).[^bitscore]: If only one read aligns to the target, these two fields will be identical. If not, they will give the higher and lower of the best bitscores for the two reads in the pair..
The pipeline can be run in two modes, "index" and "run", each of which calls the corresponding workflow from the workflows
directory.
For both modes, the pipeline can be run in multiple ways by modifying various configuration variables specified in configs/profiles.config
. Currently, three profiles are implemented, all of which assume the workflow is being launched from an AWS EC2 instance:
batch
profile is the default and attempts to run the pipeline with AWS Batch. This is the quickest and most efficient way to run the pipeline, but requires significant additional setup not described in this repo. To set up AWS Batch for this pipeline, follow the instructions here (steps 1-3), then modify your config file to point process.queue
to the name of your Batch job queue.ec2_local
profile attempts to run the whole workflow locally on your EC2 instance, storing intermediate and outflow files on instance-linked block storage. This is simple and can be relatively fast, but is bottlenecked by your instance's CPU and memory allocations; in particular, if you don't use an instance with very high memory, the pipeline is likely to fail when loading the Kraken2 reference DB.ec2_s3
profile runs the pipeline on your EC2 instance, but attempts to read and write files to a specified S3 directory. This avoids problems caused by insufficient storage on your EC2 instance, but (1) is significantly slower and (2) is still constrained by your instance's memory allocation.To run the pipeline with a specified profile, run nextflow run PATH_TO_REPO_DIR -profile PROFILE_NAME -resume
. Calling the pipeline without specifying a profile will run the batch
profile by default. Future example commands in this README will assume you are using Batch; if you want to instead use a different profile, you'll need to modify the commands accordingly.
[!TIP] It's highly recommended that you always run
nextflow run
with the-resume
option enabled. It doesn't do any harm if you haven't run a workflow before, and getting into the habit will help you avoid much sadness when you want to resume it without rerunning all your jobs.
To run this workflow with full functionality, you need access to the following dependencies:
To run the workflow using AWS S3 for the working and output directories, you first need to configure AWS access. To do this, you need to create a file at ~/.aws/config
or ~/.aws/credentials
specifying your access key ID and secret access key, e.g.
~/.aws/config
:
[default]
region = us-east-1
output = table
tcp_keepalive = true
aws_access_key_id = <ACCESS_KEY_ID>
aws_secret_access_key = <SECRET_ACCESS_KEY>
~/.aws/credentials
:
[default]
aws_access_key_id = <ACCESS_KEY_ID>
aws_secret_access_key = <SECRET_ACCESS_KEY>
If you encounter AccessDenied errors after doing this, you may also need to export these keys as environment variables before running Nextflow:
eval "$(aws configure export-credentials --format env)"
Next, you need to make sure your user is configured to use Docker. To do this, create the docker
user group and add your current user to it:
sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker
docker run hello-world
Clone this repo into a new directory as normal.
[!TIP] If someone else in your organization already uses this pipeline, it's likely they've already run the index workflow and generated an output directory. If this is the case, you can reduce costs and increase reproducibility by using theirs instead of generating your own. If you want to do this, skip this step, and edit
configs/run.config
such thatparams.ref_dir
points toINDEX_DIR/output/results
.
Create a new directory outside the repo directory and copy over the index workflow config file as nextflow.config
in that directory:
mkdir index
cd index
cp REPO_DIR/configs/index.config nextflow.config
Next, edit nextflow.config
such that params.base_dir
points to the directory (likely on S3) where you want to store your index files. (You shouldn't need to modify anything else about the config file at this stage.)
Next, call nextflow run
pointing at the repo directory (NB: not main.nf
or any other workflow file; pointing to the directory will cause Nextflow to automatically run main.nf
from that directory.):
nextflow run PATH_TO_REPO_DIR -resume
Wait for the workflow to run to completion; this is likely to take several hours at least.
To confirm that the pipeline works in your hands, we provide a small test dataset (test/raw
) to run through the run workflow. This can be used to test any of the pipeline profiles described above.
If your EC2 instance has the resources to handle it, the simplest way to start using the pipeline is to run the test data through it locally on that instance (i.e. without using S3). To do this:
test
directory.nextflow.config
to set params.ref_dir
to the index directory you chose or created above (specifically PATH_TO_REF_DIR/output/results
).test
directory, run nextflow run -profile ec2_local .. -resume
.output
directory to view the processed output files.If this is successful, the next level of complexity is to run the workflow with a working directory on S3. To do this:
test
directory, edit nextflow.config
to set params.base_dir
to the S3 directory of your choice.nextflow run -profile ec2_s3 .. -resume
.Finally, you can run the test dataset through the pipeline on AWS Batch. To do this, configure Batch as described here (steps 1-3), then:
test
directory, edit nextflow.config
to set params.base_dir
to a different S3 directory of your choice and process.queue
to the name of your Batch job queue.nextflow run -profile batch .. -resume
(or simply nextflow run .. -resume
).To run the workflow on another dataset, you need:
params.raw_dir
).params.base_dir
).params.ref_dir
).params.library_tab
) and sample sheet (params.sample_tab
).[!NOTE] Currently, the pipeline requires the following of raw data files:
- They must be contained in a single directory
- Each pair of reads files must have names ending in
_1.fastq.gz
and_2.fastq.gz
, respectively- Each pair of read files must be uniquely identifiable by a filename substring (specified in the
library
column ofparams.library_tab
)[!NOTE] The library file should specify the mapping between library read files and sample IDs. It must be a CSV file with
library
andsample
columns, as well as any other metadata columns you feel is appropriate. Seetest/libraries.csv
for a minimal example.
If running on Batch, a good process for starting the pipeline on a new dataset is as follows:
configs/run.config
to a file named nextflow.config
in that directory.nextflow.config
to specify each item in params
as appropriate, as well as setting process.queue
to the appropriate Batch queue.nextflow run PATH_TO_REPO_DIR -resume
.{params.base_dir}/output
to view and download output files.When attempting to run a released version of the pipeline, the most common sources of errors are AWS permission issues. Before debugging a persistent error in-depth, make sure that you have all the permissions specified in Step 0 of our Batch workflow guide. Next, make sure Nextflow has access to your AWS credentials, such as by running eval "$(aws configure export-credentials --format env)"
.
Another common issue is for processes to fail with some variation of the following Docker-related error:
docker: failed to register layer: write /usr/lib/jvm/java-11-openjdk-amd64/lib/modules: **no space left on device**.
This is a fairly well-known problem, which can arise even when there is substantial free storage space accessible to your EC2 instance. Following the steps recommended here or here typically resolves the issue, either by deleting Docker assets to free up space (e.g. via docker system prune --all --force
) or by giving Docker more space.
We will add more common failure modes here as they get reported.