Release

amplisim

Plain simple amplicon sequence simulator for in-silico genomic sequencing assays

Requirements
Installation
Operation
Input and output
Help

Requirements

TL;DR: no external requirements needed. Both the recursive GitHub clone as well as the bioconda package should work out-of-the-box.

🛠️ Details to build from source

The amplisim software is intended for 64-bit POSIX compliant operating systems and was tested successfully under Ubuntu 22.04 LTS and macOS v12.5.1 (Monterey). Building amplisim from source requires libraries for lzma, libbz2 and libcurl on your system in order to compile htslib. Both Linux and masOS operating systems typically provide them via their respective package managers. See intructions below.

Installation

Build via conda

The easiest way to install amplisim is via the conda package manager from the bioconda channel. Please note that the conda installation is currently only available for Linux operating systems.

# create a new conda environment
conda create --name amplisim
# install the latest amplisim version from the bioconda channel
conda install -c bioconda amplisim

Build from source

git clone --recursive https://github.com/Krannich479/amplisim.git
cd amplisim
mkdir build
make -C lib/htslib
make

🍎 macOS system dependencies

If you are working on an Apple workstation with macOS and want to build amplisim from source you might miss system libraries for openssl and argp. These can be installed using the brew package manager via ``` brew install glib-openssl argp-standalone ```

Test your build (optional)

A quick and simple way to test your software binary is to download and run amplisim on some public Sars-Cov-2 data.

mkdir testdata && cd testdata
wget https://raw.githubusercontent.com/artic-network/primer-schemes/master/nCoV-2019/V5.3.2/SARS-CoV-2.primer.bed
wget https://www.ebi.ac.uk/ena/browser/api/fasta/MN908947.3
sed 's/>ENA|MN908947|MN908947.3 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome./>MN908947.3/g' MN908947.3 > MN908947.3.fasta
cd ..
amplisim testdata/MN908947.3.fasta testdata/SARS-CoV-2.primer.bed

Operation

Help page

The most concise way to get familiar with amplisim is to inspect the help page via amplisim --help. This will display

Usage: amplisim [OPTION...] REFERENCE PRIMERS
amplisim -- a program to simulate amplicon sequences from a reference genome

  -m, --mean=INT             Set the mean number of replicates per amplicon
  -n, --sd=INT               Set the standard deviation for the mean number of
                             replicates per amplicon
  -o, --output=FILE          Output to FILE instead of standard output
  -s, --seed=INT             Set a random seed
  -x, --dropout=INT          Set the likelihood for an amplicon dropout [0,1]
  -?, --help                 Give this help list
      --usage                Give a short usage message
  -V, --version              Print program version

Mandatory or optional arguments to long options are also mandatory or optional
for any corresponding short options.

Report bugs to https://github.com/rki-mf1/amplisim/issues.

Minimal working examples

The minimal command to run amplisim is to provide a reference genome in FASTA format and a set of primers in BED format (see chapter Input and output for more details). By default, amplisim prints the amplicons sequences to the standard output such that the user can either direct the sequences to a file or forward them to the next program.

amplisim <my_reference.fasta> <my_primers.bed> > <my_amplicons.fasta>

If you want amplisim to store the resulting amplicon sequences directly in a FASTA file you can use the -o option.

amplisim -o <my_amplicons.fasta> <my_reference.fasta> <my_primers.bed>

Input and output

The primer file (input)

The PRIMERS input file is a plain tab-separated textfile with pre-defined columns. The format of the PRIMERS file required by amplisim has to comply with the following properties:

The BED format specification. I.e. the first column is a chromosome identifier, and the second and third column are the boundary indexes of a range in the chromosome. The second column is the start index of a primer and the third column is the end index of a primer. The start index should always be strictly smaller than the end index.
A pair of primers (forward and reverse primer) is expected to be in consecutive lines in the file.
The chromosome identifiers have to be arranged in blocks. I.e. irrespective of the order of the chromosomes, all primers of a particular chromosome have to occur consecutively in the file.

These format properties generally comply with the definitions in samtools but are slightly more stringent as amplisim currently does not allow alternative primers in a pair. Directly fitting examples can be found in the artic-network repository for virus primer schemes, e.g. the primers for Sars-Cov-2.

The reference file (input)

The REFERENCE input file is a standard textfile in FASTA format which contains one or multiple records (chromosomes).

The amplicons (output)

The output of amplisim is a stream or plain textfile in the FASTA format. The header line of each amplicon sequence provides the following information:

>amplicon_<amplicon_index>_<replicate_index>

where __ is the i-th index (i=0...n-1) of the amplicons defined by n primer pairs and __ is a unique index across all replicates of all amplicons. See schematic below.

Primer and amplicons scheme

Help

For questions about amplisim, feature requests and bug reports please refer to the issues section of this repository.

rki-mf1 / amplisim

readme