SmaltAlign

A consensus calling pipeline provided in two languages:

python package
bash script

Initially, the pipeline was used to make quick alignments of fastq reads against a reference using smalt, now it’s mainly used for HIV and HCV consensus generation for diagnostics.

It does the following:

Subsample reads with seqtk (optional)
Make de novo alignment of sampled reads with velvet
Align sampled reads (and only in the first iteration de novo contigs in triplicate) against reference with smalt
Create consensus with freebayes
Create vcf with lofreq
Calculate depth with samtools
if max number of iteration reached call the final consensus sequence using final vcf file and the given ambiguity threshold otherwise repeat from step 3
cov_plot.R can be used to plot the coverage

All the necessary references are in the References directory.

Use conda environment from file

To ensure you have all dependencies needed for SmaltAlign installed you can use the environment.yml file.
First you need to have Conda installed).
With the command conda env create -f <path>/environment.yml you will create a copy of the smaltalign environment.
You enter the environment with the command conda activate smaltalign (and leave it with conda deactivate).
For more information visit following link to Managing environments.

python package

To install the classic python setup.py install or pip install . will work.

Usage

smaltalign -d <fastq_file_directory> -r <reference_file> [options]

bash script

If you would like to run the bash script and you are not sure about the closest reference sequence, run smaltalign_select_ref.sh with a set of probable reference sequences in <reference_file> file in fasta format. It chooses the closest reference sequence from the set of given reference sequences and construct the consensus sequence using the chosen reference sequence.

Usage

smaltalign_select_ref.sh -r <reference_file> [options] <fastq_file/directory>

OPTIONS
-r       reference_file 
-n INT   number of reads (default 200'000)
-i INT   iterations (default 4)

If you would like to give one reference sequence in the <reference_file>, one can run smaltalign.sh. \

Usage

smaltalign.sh -r <reference_file> [options] <fastq_file/directory>
OPTIONS
-r       reference_file (only one reference sequence)
-n INT   number of reads (default 200'000)
-i INT   iterations (default 4)

batch.sh

Used to run multiple samples in the current working directory with different references in one batch.
To analyse the results of a Diagnostic sequencing run following steps need to be done:

create a new folder in /data/Diagnostics/experiments/ with the date of the sequencing run (start-date, yymmdd)
in that new folder create links to the .fastq files you want to analyse (ln -sv) and copy the SampleSheet.csv of that run
copy the batch.sh file into that new folder
add the filenames (you can use sampleID_to_filename.xltx) to the empty virus arrays in batch.sh separated by a new line (works if you copy from the excel file)
activate SmaltAlign environment (source activate smaltalign)
execute ./batch.sh

batch_influenza.sh

This shell script was written to process Influenza sequences with SmaltAlign:

iteration over all .fastq.gz files in the current directory
create a folder for each sample containing segment1-8 subfolders
run select_ref.py (written by @ozagordi) which selects the best reference sequence for each segment from a Influenza reference database (selected sequences from the NCBI Influenza Virus Database)
using the best reference sequence to run smaltalign.sh for each segment
run Rscripts cov_plot.R and wts.R

Usage is the same as in batch.sh except that you don't need to enter the filenames.

wts.R

wts.R is an R script to combine consensus sequence, variants and coverage for the last iteration of all lofreq.vcf files in a directory. It saves a _x_WTS.fasta file containing the consensus sequence with wobbles (at a certain threshold x) and a .csv file containing coverage and variant frequencies for every position. The the variant threshold and the minimal coverage have to be adapted manually in the first lines.

cov_plot.R

cov_plot.R is an R script to plot and save the coverage of all iterations of all .depth files in the working directory.

Contributions

Maryam Zaheri*
Stefan Schmutz
Osvaldo Zagordi
Michael Huber**

*maintainer ; **group leader

medvir / SmaltAlign

readme

SmaltAlign

Use conda environment from file

python package

Usage

bash script

Usage

Usage

batch.sh

batch_influenza.sh

wts.R

cov_plot.R

Contributions