Open ndreey opened 1 year ago
NOTES FROM THE PAPER
CMISIM allows customization of many properties:
These settings are determined by the configuration file config.ini
.
CAMISIM works in three stages:
1. Designing community
replicates mode
generates metagenome data sets with multiple samples that have similar genome abundance.differential mode
: Generates differential abundance metagenome samples based on each samples abundance*.tsv
2. Metagenome simulation
abundance.tsv
nt
, is determined by multiplying the abundance of that taxon, abt
, by the total number of reads in the sample, n
, and dividing by the genome size of that taxon, st
. 3. Gold standard creating and postprocessing
mpileup
from SAMtools
is used to identify all genomic regions with a coverage of at least one.mpileup
takes a sorted BAM file
as input and generates a pileup file
that shows the read coverage at each position in the reference genome. Example of pileup file
from mpileup
#CHROM POS REF COV READS QUALS
chr1 1 A 2 AG FE
chr1 2 C 3 CCA FFA
chr1 3 G 1 T E
The headings in this pileup file are:
#CHROM
: the chromosome or contig name or genome.POS
: the position in the chromosome or contigREF
: the reference base at that positionCOV
: the read coverage at that positionREADS
: the read bases that map to that positionQUALS
: the base qualities for the reads that map to that position_Notes from thee GITHUB MANUAL and the files from the SETUP ARCHIVE_
genome_to_ids.tsv
: TSV file without header that holds genome_ID
and path
to reference genome.metadata.tsv
: TSV file with header that holds genome_ID
, operational taxonomic unit assignment OTU
, taxid NCBI_ID
and novelty_category
.
genome_ID OTU NCBI_ID novelty_category
Otu522.1 41294 374 new_species
Otu522.0 41294 85413 new_species
Aspergillus_fumigatus_MPI-SW4-AT-0569 746128 746128 known_strain
Phialocephala_fortinii_MPI-SW4-AT-0651 62722 62722 known_strain
abundance*.tsv
: File that specifies the relative abundance. Which seems to be kind of tricky to define.
Example from sample_1/abundance2.tsv
Cladosporium_rectoides_MPI-GEGE-AT-0032 1.145203067508997
Gibellulopsis_nigrescens_MPI-SP2-AT-0410 11.305394360852437
Metacordyceps_chlamydosporia_MPI-IT2-AT-0323 1.4393342015492996
Ochroconis_tshawytschae_MPI-SP2-AT-0416 0.10262961014917946
Phomopsis_columnaris_MPI-SP2-AT-0504 0.0972154941002939
Umbelopsis_autotrophica_MPI-SW4-AT-0611 0.02325594896697013
Embellisia_chlamydospora_MPI-FR1-AT-0336 0.9337332247335333
Fusarium_oxysporum_MPI-PUGE-AT-0057 0.375901531
Otu22.0 44.0
Otu96.0 23.0
Otu1087 0.0
HOW I UNDERSTAND THE ABUNDANCE CALCULATION The abundance is calculated based on the total sum of genome sizes.
G1, G2, G3, ..., G10, Orchid
genomes. And i want Orchid
to have an abundance of 50%.G1:G5
is 1000bp each, G6:G10
is 1500bp each and Orchid
is 12000bp
Genome Size
G1 1000
G2 1000
...
G6 1500
G7 1500
...
Orchid 12000
abu = 1 / (number of genomes - 1)
Orchid
to 0.5abu_tot = abu_orchid + sum(abu_G1:abu_G10)
nrm_abu_orchid = abu_ochid / abu_tot
= 0.5 / 1.5 = 0.3333abundance.tsv
file0.0741 x 10 + 0.3333 ~ 1
Genome Abundance
G1 0.0741
G2 0.0741
...
G10 0.0741
Orchid 0.3333
Good info on these issues
MAIN
seed
: Sets seed to bee consistent with RNG.phase
: Full run (0), Only community design (1), Read Simulation (2)max_processors
: Number of cores to run ondataset_id
: ID for created sample dataoutput_directory
: Set path to output directory. Will create if it does not exist.temp_directory
: Set path to store tmp files.gsa
: True/Yes or False/No to create gold standard assembly for each sample.pooled_gsa
: True/Yes or False/No to greater GSA for all samples combined.anonymous
: Set to No/False as it is not relevant for this project.compress
: Set value between 0-9, 0 is No compression, 9 is max. Recommended to set at least 1.ReadSimulator
readsim
: ART comes with CAMISIM, set path to path/tools/art_illumina-2.3.6/art_illumina
error_profiles
: Path to the error profiles, set path/tools/art_illumina-2.3.6/profiles
samtools
: SAMtools v1.3 comes with CAMISIM, set path to path/tools/samtools-1.3/samtools
profile
: Select error profile, recommended mbarc
.size
: Size of each sample in Gigabasepairs (Gbp). One Gbp equaled one GB in size for the rhizosphere mock datatype
: set to art
fragments_size_mean
: Mean size of the fragments. Set 270.fragments_size_standard_deviation
: SD of fragments. Set 27.CommunityDesign
distribution_file_paths
: Path to each abundanceX.tsv
file as a list. ['path/abundance0.tsv', 'path/abundance1.tsv', 'path/abundance2.tsv'
ncbi_taxdump
: database containing the hierarchical classification of all known organisms. Set to path/tools/ncbi-taxonomy_20170222.tar.gz
strain_simulation_template
: Path to a template.tree for the sgEvolver from the mauve suite. scripts/StrainSimulationWrapper/sgEvolver/simulation_dir
number_of_samples
: Must match the number of abundance profiles.community0
metadata
: Path to the metadata.tsv
id_to*genome_file
: Path to the genome_to_ids.tsv
id_to_gff_file
: Optional, set it to nothing.genomes_total
: Total number of simulated genomes.genomes_real
: Number of genomes used as input.max_strains_per_otu
: Max number of strains drawn from genomes belonging to single OTU.ratio
: Ratio between different communities. Set to 1.mode
: Either replicates
or differential
should be used for the project.log_mu=1
log_sigma=2
gauss_mu=1
: Used with timeseries, so set to 1.gauss_sigma=1
: Used with timeseries, so set to 1.view
: Set to False
as we don't need to see the distribution of genomes.
The setup files
rhimgCAMI2_setup.tar.gz
and the genomesrhimgCAMI2_genomes.tar.gz
used for the A. thaliana rhizosphere mock data have been downloaded.wget -P /path/ https://frl.publisso.de/data/frl:6425521/plant_associated/short_read/rhimgCAMI2_setup.tar.gz
wget -P /path/ https://frl.publisso.de/data/frl:6425521/plant_associated/rhimgCAMI2_genomes.tar.gz
Read through and understand how these files can be edited to meet my needs.