Grasping CAMISIM - Githubissues

ndreey commented 1 year ago

The setup files rhimgCAMI2_setup.tar.gz and the genomes rhimgCAMI2_genomes.tar.gz used for the A. thaliana rhizosphere mock data have been downloaded.

wget -P /path/ https://frl.publisso.de/data/frl:6425521/plant_associated/short_read/rhimgCAMI2_setup.tar.gz
wget -P /path/ https://frl.publisso.de/data/frl:6425521/plant_associated/rhimgCAMI2_genomes.tar.gz

Read through and understand how these files can be edited to meet my needs.

CAMISIM GitHub
Fritz, Hofmann, et al. (2019). CAMISIM: Simulating metagenomes and microbial communities. Microbiome, 2019, 7:17. doi:10.1186/s40168-019-0633-6

ndreey commented 1 year ago

NOTES FROM THE PAPER

CMISIM allows customization of many properties:

The overall number of genomes (community complexity).
Strain diversity
The community genome abundance distribtuions
Sample sizes
number of replicates
NGS or 3GS technologies

These settings are determined by the configuration file config.ini.

CAMISIM works in three stages:

Design of the community, which includes the selection of the community members and their genomes, and assigning them relative abundances.
Metagenome sequencing data simulation.
Postprocessing, where the binning and assembly gold standards are produced.

1. Designing community

replicates mode generates metagenome data sets with multiple samples that have similar genome abundance.
differential mode: Generates differential abundance metagenome samples based on each samples abundance*.tsv

2. Metagenome simulation

Metagenome data sets are generated from the genome abundance profiles, abundance.tsv
The number of reads mapped to a particular taxon, nt, is determined by multiplying the abundance of that taxon, abt, by the total number of reads in the sample, n, and dividing by the genome size of that taxon, st.
ART is used to create Illumina 2 x 150 bp paired-end reads with HiSeq 2500 error profile.
FASTQ and BAM files are generated for each data set. The BAM file specifies the alignment of the simulated reads to the reference genomes.

3. Gold standard creating and postprocessing

CAMISIM generates the assembly and binning gold standards using the FASTQ and BAM files.
Assembly gold standards are created by identifying the "perfect/error free" contigs.
- mpileup from SAMtools is used to identify all genomic regions with a coverage of at least one.
- mpileup takes a sorted BAM file as input and generates a pileup file that shows the read coverage at each position in the reference genome.
- CAMISIM then extracts these regions as error-free contigs. The error-free contigs from each sample are then concatenated together to form the gold standard metagenome.
One can now benchmark different assemblers by using the same simulated reads and compare their metagenome against the gold standard metagenome

CAMISIM_Fig1

ndreey commented 1 year ago

Example of pileup file from mpileup

#CHROM  POS     REF     COV     READS   QUALS
chr1    1       A       2       AG      FE
chr1    2       C       3       CCA     FFA
chr1    3       G       1       T       E

The headings in this pileup file are:

#CHROM: the chromosome or contig name or genome.
POS: the position in the chromosome or contig
REF: the reference base at that position
COV: the read coverage at that position
READS: the read bases that map to that position
QUALS: the base qualities for the reads that map to that position

ndreey commented 1 year ago

_Notes from thee GITHUB MANUAL and the files from the SETUP ARCHIVE_

genome_to_ids.tsv: TSV file without header that holds genome_ID and path to reference genome.

metadata.tsv: TSV file with header that holds genome_ID, operational taxonomic unit assignment OTU, taxid NCBI_ID and novelty_category.

genome_ID   OTU NCBI_ID novelty_category
Otu522.1    41294   374 new_species
Otu522.0    41294   85413   new_species
Aspergillus_fumigatus_MPI-SW4-AT-0569   746128  746128  known_strain
Phialocephala_fortinii_MPI-SW4-AT-0651  62722   62722   known_strain

abundance*.tsv: File that specifies the relative abundance. Which seems to be kind of tricky to define. Example from sample_1/abundance2.tsv

Cladosporium_rectoides_MPI-GEGE-AT-0032 1.145203067508997
Gibellulopsis_nigrescens_MPI-SP2-AT-0410    11.305394360852437
Metacordyceps_chlamydosporia_MPI-IT2-AT-0323    1.4393342015492996
Ochroconis_tshawytschae_MPI-SP2-AT-0416 0.10262961014917946
Phomopsis_columnaris_MPI-SP2-AT-0504    0.0972154941002939
Umbelopsis_autotrophica_MPI-SW4-AT-0611 0.02325594896697013
Embellisia_chlamydospora_MPI-FR1-AT-0336    0.9337332247335333
Fusarium_oxysporum_MPI-PUGE-AT-0057 0.375901531
Otu22.0 44.0
Otu96.0 23.0
Otu1087 0.0

HOW I UNDERSTAND THE ABUNDANCE CALCULATION The abundance is calculated based on the total sum of genome sizes.

Say i have G1, G2, G3, ..., G10, Orchid genomes. And i want Orchid to have an abundance of 50%.

G1:G5 is 1000bp each, G6:G10 is 1500bp each and Orchid is 12000bp

Genome    Size
G1        1000
G2        1000
...   
G6        1500
G7        1500
...    
Orchid    12000

Calculate the total genome size
- tot = 1000 x 5 + 1500 x 5 + 12000 = 19500bp
Calculate the abundance value for each genome.
- abu = 1 / (number of genomes - 1)
- For G1 to G10 there are 10-1 = 9 genomes.
- abu for G1 to G10 are 1/9 = 0.1111
Set the abundance value of Orchid to 0.5
Calculate the total abundance value for all genomes.
- abu_tot = abu_orchid + sum(abu_G1:abu_G10)
- abu_tot= 0.5 + 9 x 0.1111 = 1.5
Normalize the abundance values so they sum up to 1.
- nrm_abu_orchid = abu_ochid / abu_tot= 0.5 / 1.5 = 0.3333
- nrm_abu_Gn = 0.1111 / 1.5 = 0.0741
BOOOM there is your relative abundance. But NOTE there should not be a heading row in the abundance.tsv file
However, it seems that abundance don't have to sum up to 1 as can be seen in the example above. But doing it this way i am able to sum all abundances to 1 "ish". 0.0741 x 10 + 0.3333 ~ 1
```
Genome    Abundance
G1        0.0741
G2        0.0741
... 
G10       0.0741
Orchid    0.3333
```

Good info on these issues

ndreey commented 1 year ago

Configuration File

MAIN

seed: Sets seed to bee consistent with RNG.
phase: Full run (0), Only community design (1), Read Simulation (2)
max_processors: Number of cores to run on
dataset_id: ID for created sample data
output_directory: Set path to output directory. Will create if it does not exist.
temp_directory: Set path to store tmp files.
gsa: True/Yes or False/No to create gold standard assembly for each sample.
pooled_gsa: True/Yes or False/No to greater GSA for all samples combined.
anonymous: Set to No/False as it is not relevant for this project.
compress: Set value between 0-9, 0 is No compression, 9 is max. Recommended to set at least 1.

ReadSimulator

readsim: ART comes with CAMISIM, set path to path/tools/art_illumina-2.3.6/art_illumina
error_profiles: Path to the error profiles, set path/tools/art_illumina-2.3.6/profiles
samtools: SAMtools v1.3 comes with CAMISIM, set path to path/tools/samtools-1.3/samtools
profile: Select error profile, recommended mbarc.
size: Size of each sample in Gigabasepairs (Gbp). One Gbp equaled one GB in size for the rhizosphere mock data
type: set to art
fragments_size_mean: Mean size of the fragments. Set 270.
fragments_size_standard_deviation: SD of fragments. Set 27.

CommunityDesign

distribution_file_paths: Path to each abundanceX.tsv file as a list. ['path/abundance0.tsv', 'path/abundance1.tsv', 'path/abundance2.tsv'
ncbi_taxdump: database containing the hierarchical classification of all known organisms. Set to path/tools/ncbi-taxonomy_20170222.tar.gz
strain_simulation_template: Path to a template.tree for the sgEvolver from the mauve suite. scripts/StrainSimulationWrapper/sgEvolver/simulation_dir
number_of_samples: Must match the number of abundance profiles.

community0

metadata: Path to the metadata.tsv
id_to*genome_file: Path to the genome_to_ids.tsv
id_to_gff_file: Optional, set it to nothing.
genomes_total: Total number of simulated genomes.
genomes_real: Number of genomes used as input.
max_strains_per_otu: Max number of strains drawn from genomes belonging to single OTU.
ratio: Ratio between different communities. Set to 1.
mode: Either replicates or differential should be used for the project.
log_mu=1
log_sigma=2
gauss_mu=1: Used with timeseries, so set to 1.
gauss_sigma=1: Used with timeseries, so set to 1.
view: Set to False as we don't need to see the distribution of genomes.

ndreey / ghost-magnet

Grasping CAMISIM #13