goldman-gp-ebi / SWAMPy

Other
12 stars 5 forks source link

SWAMPy

This project is intended to simulate SARS-CoV-2 metagenomes taken from wastewater samples. Synthetic mixtures of amplicons are produced, based on proportions of viral genomes that are supplied by the user and a supported primer set of choice. See our method to learn about how it works.

image

Installation

  1. Clone this repository.
git clone https://github.com/goldman-gp-ebi/SWAMPy.git

cd SWAMPy
  1. Install dependencies pandas and biopython. You need to ensure bowtie2 and art_illumina are available from your command line (i.e. both of the binaries of these tools are available from your $PATH environment variable). The program also requires Python version 3.9.7.

A way to do this on Ubuntu 20.04, with Python 3.9.7 and pip already installed, is below:

pip install pandas, biopython
sudo apt-get install art-nextgen-simulation-tools, bowtie2

Or you can also use conda as follows to create a new environment with the dependencies and correct versions (RECOMMENDED):

conda env create -f SWAMPy-env.yaml
conda activate SWAMPy

MAC USERS also need:

brew install coreutils

If you have a Mac with an M1 apple chip, the procedure is more annoying, sorry! In this circumstance, bioconda doesn't seem to work. Instead, the following process seems to work instead (though not everything ended up inside a conda environment...):

conda env create -f SWAMPy-env.yaml
# This should successfully install numpy, biopython, pandas, and gsl (2.7).
# If not then download them using pip. 
brew install coreutils
brew install bowtie2
# Install these two things system-wide(!).

Next, manually install art from source. Download ART-src-MountRainier-2016.06.05-MacOS.tgz (this is the current latest version as of December 2022, and has been since 2016), unzip the archive and run

./configure && make

and then move the art_illumina binary into wherever the conda environment lives (conda env list prints the path to the folder, inside which is a bin directory).

On Mac Sonoma I recently had to do this in a slightly hackier way - using the following (which is in the README for ART):

export \
    CFLAGS="$CFLAGS -I/opt/homebrew/Caskroom/miniforge/base/envs/swampy/include" \
    CPPFLAGS="$CPPFLAGS -I/opt/homebrew/Caskroom/miniforge/base/envs/swampy/include" \
    LDFLAGS="$LDFLAGS -L/opt/homebrew/Caskroom/miniforge/base/envs/swampy/lib"

This is where the lib and include paths of gsl were with my installation - use conda list while inside the active SWAMPy-env environment to find this path.

I then had to run ./configure && make && install_name_tool -add_rpath /opt/homebrew/Caskroom/miniforge/base/envs/swampy/lib ./art_illumina, the install name tool fixed an issue with LC_RPATH.

Optionally, run alias swampy='python /path/to/SWAMPy/src/simulate_metagenome.py' to use swampy as a command rather than python simulate_metagenome.py.

Docker Installation

The latest SWAMPy Docker image can be found on Docker Hub at https://hub.docker.com/r/wboulton12/swampy, and pulled through the command

docker pull wboulton12/swampy:latest

To run SWAMPy through Docker, mount the volumes from the input files (genomes and relative abundances) into the /swampy_in directory in the virtual filesystem, and mount the directory where output files are expected to /swampy_out. The command below is an example, where the output files are sent to the present working directory:

docker run -u $(id -u) -v $(pwd):/swampy_out -v /home/will/Desktop/SWAMPy/docker_example:/swampy_in wboulton12/swampy:latest --genomes_file /swampy_in/my_genomes.fasta --genome_abundances /swampy_in/my_abundances.tsv

# docker run (not as root, as your user), 
# (mount current directory to swampy_out, mount the folder containing genomes and abundance files to swampy_in),
# name of container,
# command line arguments passed to swampy. 
# Note that since the input and output folders are already mapped, you should not pass a --output_folder flag to swampy. 

Singularity Installation

Build the singularity image file after pulling SWAMPy from the Docker Hub:

singularity pull docker://wboulton12/swampy:latest

Then execute the sif file (use exec rather than run):

singularity exec swampy_latest.sif python3 /home/src/simulate_metagenome.py \
    --genomes_file /path/to/genomes.fasta \
    --genome_abundances /path/to/abundances.tsv \
    --temp_folder /some/temporary/folder/you/can/write/to \
    --output_folder $(pwd) 

Quickstart

Example input files i.e. genomes.fasta and abudances.tsv are already included in the example directory.

Run the example simulation

This creates a synthetic metagenome from the fasta files in the example/genomes folder, using relative genome proportions from the example/abundances.tsv. The primers used are from the ARTIC protocol v1; compared to v3, the primers marked as 'alt' are removed. For details, please see the potential bugs section below.

Run the simulator with default parameters

cd src
python simulate_metagenome.py --primer_set a1 --genomes_file ../example/genomes.fasta --genome_abundances ../example/abundances.tsv --output_folder ../simulation_output --temp_folder ../example/temp/ --autoremove

See the help page

python simulate_metagenome.py --help

Run the same example simulation again with all the parameters explicitly defined, and with the random seed set.

python simulate_metagenome.py \
    --genomes_file ../example/genomes.fasta \
    --temp_folder ../example/temp/
    --genome_abundances ../example/abundances.tsv \
    --primer_set a1 \
    --output_folder ../simulation_output \
    --output_filename_prefix example \
    --n_reads  100000 \
    --seqSys MSv3 \
    --read_length 250 \
    --seed 10 \
    --amplicon_distribution  dirichlet_1 \
    --amplicon_pseudocounts 200 \
    --unique_insertion_rate 0.00002 \
    --unique_deletion_rate 0.000115 \
    --unique_substitution_rate 0.002485 \
    --recurrent_insertion_rate 0.00002 \
    --recurrent_deletion_rate 0 \
    --recurrent_substitution_rate 0.003357 \
    --max_insertion_length 14 \
    --subs_VAF_alpha 0.36,0.27 \
    --del_VAF_alpha 0.59,0.42 \
    --ins_VAF_alpha 0.33,0.45 \
    --r_subs_VAF_alpha 0.36,0.27 \
    --r_del_VAF_alpha 0.59,0.42 \
    --r_ins_VAF_alpha 0.33,0.45 

example1.fastq and example2.fastq should appear in simulation_output folder, as well as example_amplicon_abundances_summary.tsv, example.log and example_PCR_erros.vcf. While running, some tmp files might appear in your working directory, but they will get cleaned up when the program terminates (even if it exits with an error).

Output files:

What these CLI arguments mean:

See the CLI arguments page.

Extra options and potential bugs

Things to watch out for:

Citation

If you used SWAMPy, please cite the publication SWAMPy: Simulating SARS-CoV-2 Wastewater Amplicon Metagenomes with Python.

Change log

Acknowledgements

We are using bowtie2, ART and biopython, as well as other python packages (pandas and numpy) and are using a primer set coming from ARTIC v1. The genomes in the example are courtesy of Twist Bioscience; these genomes are in turn copies of surveillance sequences from COG UK.


associated repos: Will's notes