A consensus calling pipeline provided in two languages:
Initially, the pipeline was used to make quick alignments of fastq reads against a reference using smalt, now it’s mainly used for HIV and HCV consensus generation for diagnostics.
It does the following:
All the necessary references are in the References directory.
To ensure you have all dependencies needed for SmaltAlign installed you can use the environment.yml
file.
First you need to have Conda installed).
With the command conda env create -f <path>/environment.yml
you will create a copy of the smaltalign environment.
You enter the environment with the command conda activate smaltalign
(and leave it with conda deactivate
).
For more information visit following link to Managing environments.
To install the classic python setup.py install
or pip install .
will work.
smaltalign -d <fastq_file_directory> -r <reference_file> [options]
If you would like to run the bash script and you are not sure about the closest reference sequence, run smaltalign_select_ref.sh
with a set of probable reference sequences in <reference_file>
file in fasta format. It chooses the closest reference sequence from the set of given reference sequences and construct the consensus sequence using the chosen reference sequence.
smaltalign_select_ref.sh -r <reference_file> [options] <fastq_file/directory>
OPTIONS
-r reference_file
-n INT number of reads (default 200'000)
-i INT iterations (default 4)
If you would like to give one reference sequence in the <reference_file>
, one can run smaltalign.sh
. \
smaltalign.sh -r <reference_file> [options] <fastq_file/directory>
OPTIONS
-r reference_file (only one reference sequence)
-n INT number of reads (default 200'000)
-i INT iterations (default 4)
Used to run multiple samples in the current working directory with different references in one batch.
To analyse the results of a Diagnostic sequencing run following steps need to be done:
/data/Diagnostics/experiments/
with the date of the sequencing run (start-date, yymmdd)ln -sv
) and copy the SampleSheet.csv
of that runbatch.sh
file into that new foldersampleID_to_filename.xltx
) to the empty virus arrays in batch.sh
separated by a new line (works if you copy from the excel file)source activate smaltalign
)./batch.sh
This shell script was written to process Influenza sequences with SmaltAlign:
.fastq.gz
files in the current directoryselect_ref.py
(written by @ozagordi) which selects the best reference sequence for each segment from a Influenza reference database (selected sequences from the NCBI Influenza Virus Database)
smaltalign.sh
for each segmentcov_plot.R
and wts.R
Usage is the same as in batch.sh except that you don't need to enter the filenames.
wts.R
is an R script to combine consensus sequence, variants and coverage for the last iteration of all lofreq.vcf
files in a directory.
It saves a _x_WTS.fasta
file containing the consensus sequence with wobbles (at a certain threshold x) and a .csv
file containing coverage and variant frequencies for every position.
The the variant threshold and the minimal coverage have to be adapted manually in the first lines.
cov_plot.R
is an R script to plot and save the coverage of all iterations of all .depth
files in the working directory.
*maintainer ; **group leader