Cause simulating MG data should be easy
MGEasySim is a pile of python code to streamline the process of generating a simulated metagenome from a desired list of GTDB genomes.
You're developing a tool to analyze the metagenomic sequencing of bacteria in hydrothermal vents. You've gotten a sample and inferred its composition using sylph but you haven't gotten the rest of the samples. Using the list of GTDB genomes present in your sample from sylph, you could use MGEasySim to generate as many simulated communities as you need to develop your metagenomics workflow.
We're on version 0.2.0: not much functionality has been added.
mamba create -f environment.yml
wget https://data.gtdb.ecogenomic.org/releases/latest/bac120_metadata.tsv.gz
# from repo base
pip install -e .
config
module)The package needs to know where your version of the GTDB database is and where you want to write output files. To run the configuration module:
usage: mgeasysim config [-h] --gtdb GTDB --output OUTPUT [--threads THREADS] [--verbose VERBOSE]
options:
-h, --help show this help message and exit
--gtdb GTDB, -g GTDB Location of GTDB database
--output OUTPUT, -o OUTPUT
Location of outfile
--threads THREADS, -@ THREADS
Number of threads
--verbose VERBOSE, -v VERBOSE
Verbosity
To set the output path and path to the GTDB database, run the command:
mgeasysim config -g [path to GTDB folder] -o [output path] -@ [threads]
This writes a config.yaml
file to the mgeasysim/
folder which is readable by all modules, enabling global access to important variables.
Each time the config
module is imported, it loads the current configuration file from the package directory mgeasysim/config.yaml
and prints the contents:
Using config:
{'database': {'gtdb_loc': '/Users/michaelhoffert/Documents/mgeasysim/GTDB_r220'},
'info': {'software': 'mgeasysim'},
'locations': {'matches_path': '/Users/michaelhoffert/Documents/mgeasysim/output_test/matches.tsv.gz',
'output': '/Users/michaelhoffert/Documents/mgeasysim/output_test'},
'parameters': {'n_sims': 3,
'n_species': 15,
'n_strains': 3,
'power_a': 0.5,
'verbose': True},
'runtime': {'threads': 2}}
Unfortunately, this means right now you can only have one active configuration at a time.
community
module)usage: mgeasysim community [-h] --taxlist TAXLIST [--n_sims N_SIMS] [--n_species N_SPECIES] [--power_a POWER_A] [--n_strains N_STRAINS]
options:
-h, --help show this help message and exit
--taxlist TAXLIST, -t TAXLIST
File to define community
--n_sims N_SIMS, -n N_SIMS
Number of simulations
--n_species N_SPECIES, -s N_SPECIES
Number of species in each simulation
--power_a POWER_A, -a POWER_A
Power distribuion A parameter
--n_strains N_STRAINS, -x N_STRAINS
Number of same-species strains to include in simulation
Arguments:
simulate
module)usage: mgeasysim simulate [-h] [--n_reads N_READS] [--alt_dbs ALT_DBS]
options:
-h, --help show this help message and exit
--n_reads N_READS, -n N_READS
Number of reads per simulation
--alt_dbs ALT_DBS, -a ALT_DBS
whether to simulate alternate genome databases
The n_reads item controls the number of reads per simulation (all other qualities are controlled in the 'community' step). alt_dbs
creates several alternate databases: this package began its life as a method of testing the sylph metagenomic profiler. The alt dbs include: