medvir / SmaltAlign

Quick iterative alignment of reads against a given reference using smalt.
MIT License
3 stars 1 forks source link


Build Status

A consensus calling pipeline provided in two languages:

Initially, the pipeline was used to make quick alignments of fastq reads against a reference using smalt, now it’s mainly used for HIV and HCV consensus generation for diagnostics.

It does the following:

  1. Subsample reads with seqtk (optional)
  2. Make de novo alignment of sampled reads with velvet
  3. Align sampled reads (and only in the first iteration de novo contigs in triplicate) against reference with smalt
  4. Create consensus with freebayes
  5. Create vcf with lofreq
  6. Calculate depth with samtools
  7. if max number of iteration reached call the final consensus sequence using final vcf file and the given ambiguity threshold otherwise repeat from step 3
  8. cov_plot.R can be used to plot the coverage

All the necessary references are in the References directory.

Use conda environment from file

To ensure you have all dependencies needed for SmaltAlign installed you can use the environment.yml file.
First you need to have Conda installed).
With the command conda env create -f <path>/environment.yml you will create a copy of the smaltalign environment.
You enter the environment with the command conda activate smaltalign (and leave it with conda deactivate).
For more information visit following link to Managing environments.

python package

To install the classic python install or pip install . will work.

smaltalign -d <fastq_file_directory> -r <reference_file> [options] 

bash script

If you would like to run the bash script and you are not sure about the closest reference sequence, run with a set of probable reference sequences in <reference_file> file in fasta format. It chooses the closest reference sequence from the set of given reference sequences and construct the consensus sequence using the chosen reference sequence.

Usage -r <reference_file> [options] <fastq_file/directory>

-r       reference_file 
-n INT   number of reads (default 200'000)
-i INT   iterations (default 4)

If you would like to give one reference sequence in the <reference_file>, one can run \

Usage -r <reference_file> [options] <fastq_file/directory>
-r       reference_file (only one reference sequence)
-n INT   number of reads (default 200'000)
-i INT   iterations (default 4)

Used to run multiple samples in the current working directory with different references in one batch.
To analyse the results of a Diagnostic sequencing run following steps need to be done:

This shell script was written to process Influenza sequences with SmaltAlign:

Usage is the same as in except that you don't need to enter the filenames.


wts.R is an R script to combine consensus sequence, variants and coverage for the last iteration of all lofreq.vcf files in a directory. It saves a _x_WTS.fasta file containing the consensus sequence with wobbles (at a certain threshold x) and a .csv file containing coverage and variant frequencies for every position. The the variant threshold and the minimal coverage have to be adapted manually in the first lines.


cov_plot.R is an R script to plot and save the coverage of all iterations of all .depth files in the working directory.


*maintainer ; **group leader