ViReflow is a tool for constructing elastically-scaling parallelized automated AWS pipelines for viral consensus sequence generation. Given sequence data from a viral sample as well as information about the reference genome and primers, ViReflow generates a Reflow file that contains all steps of the workflow, including AWS instance specifications. Because ViReflow is intended to be used with Reflow, the workflows that are developed by ViReflow automatically distribute independent tasks to be run in parallel as well as elastically scale AWS instances based on each individual step of the workflow. ViReflow makes use of compact minimal Docker images for each step of the viral analysis workflow, details about which can be found in the Niema-Docker GitHub organization.
Read Trimmers | Read Mappers | Variant Callers | Optional Analyses |
---|---|---|---|
fastp | Bowtie2 | FreeBayes | coronaSPAdes |
iVar Trim | BWA-MEM | iVar Variants | MEGAHIT |
PRINSEQ | Minimap2 | LoFreq | metaviralSPAdes |
pTrimmer | minia | ||
Pangolin (COVID-19) | |||
rnaviralSPAdes | |||
VirStrain | |||
π Diversity Metric |
ViReflow is written in Python 3. You can simply download ViReflow.py to your machine and make it executable:
wget "https://raw.githubusercontent.com/niemasd/ViReflow/master/ViReflow.py"
chmod a+x ViReflow.py
sudo mv ViReflow.py /usr/local/bin/ViReflow.py # optional step to install globally
While ViReflow itself only depends on Python 3, the pipelines it produces are Reflow files that run on AWS. Thus, in order to run the pipelines ViReflow produces, one must first install Reflow.
ViReflow can be used as follows:
usage: ViReflow.py [-o OUTPUT_RF] -d DESTINATION -rf REFERENCE_FASTA -rg REFERENCE_GFF -p PRIMER_BED [OPTIONAL ARGS] FASTQ1 [FASTQ2 ...]
For extensive details about each command line argument, see the Command Line Argument Descriptions section of the ViReflow wiki.
We have provided demo files, and ViReflow can be executed as follows:
ViReflow.py -o demo.rf `# output Reflow run file` \
-d s3://my-s3-folder/vireflow-demo `# output S3 folder` \
-rf https://github.com/niemasd/ViReflow/raw/main/demo/NC_045512.2.fas `# reference genome (FASTA)` \
-rg https://github.com/niemasd/ViReflow/raw/main/demo/NC_045512.2.gff3 `# reference genome annotation (GFF3)` \
-p https://github.com/niemasd/ViReflow/raw/main/demo/sarscov2_v2_primers_swift.bed `# primer coordinates file (BED)` \
https://github.com/niemasd/ViReflow/raw/main/demo/test_R1.fastq `# FASTQ 1` \
https://github.com/niemasd/ViReflow/raw/main/demo/test_R2.fastq `# FASTQ 2`
This will result in the creation of a file called demo.rf
, which is the Reflow workflow file. Assuming Reflow is properly installed and configured, the workflow can now be run as follows:
reflow run demo.rf
In a given sequencing experiment, if you have multiple samples you want to run (e.g. sample1
, sample2
, ..., sampleN
), you can use ViReflow to process all of them in parallel (assuming your AWS account has access to spin up sufficient EC2 instances). First, you need to use ViReflow to produce a Reflow run file (.rf
) for each sample:
for s in sample1 sample2 [REST_OF_SAMPLES] sampleN ; do ViReflow.py -id $s -o $s.rf [REST_OF_VIREFLOW_ARGS] ; done
Alternatively, you can create a CSV file in the following format that, in which the first column contains the run ID, and all remaining columns denote the FASTQ files. You can then run ViReflow as follows to generate the Reflow files for all runs:
sample1 | sample1_R1.fastq | s3://my_samples/sample1_R2.fastq |
sample2 | sample2_R1.fastq | s3://my_samples/sample2_R2.fastq |
... | ... | ... |
ViReflow.py [VIREFLOW_ARGS] my_samples.csv
Then, you can use the rf_batch.py
script to create a batch Reflow run file that will execute all of the individual sample Reflow run files:
rf_batch.py -o batch_samples.rf sample1.rf sample2.rf [REST_OF_SAMPLES].rf sampleN.rf
Now, you can simply run Reflow on the newly-created batch_samples.rf
, and it will automatically execute of the individual sample Reflow run files:
reflow run batch_samples.rf
If you use ViReflow in your work, please cite:
Moshiri N, Fisch KM, Birmingham A, DeHoff P, Yeo GW, Jepsen K, Laurent LC, Knight R (2022). "The ViReflow pipeline enables user friendly large scale viral consensus genome reconstruction." Scientific Reports. 12:5077. doi:10.1038/s41598-022-09035-w
Please also cite the mapper, trimmer, variant caller, and optional analysis tool(s) you used in your ViReflow run(s).