metaGmetapop / metapop

A pipeline for the macro- and micro-diversity analyses and visualization of metagenomic-derived populations
MIT License
38 stars 10 forks source link

MetaPop

A pipeline for the macro- and micro-diversity analyses and visualization of metagenomic-derived populations

Please email gregory.392(at)osu.edu and kenji.gerhardt(at)gmail.com for issues.

2023: We're currently doing a major overhaul of the pipeline into Nextflow, so updates on this version have been stalled

Description

MetaPop is a pipeline designed to facilitate the processing of sets of short read data mapped to reference genomes with the twin aims of calculating sample-level diversity metrics such as abundance, population diversity, and similarity across multiple samples, and assessing within-species diversity through the assessment of nucleotide polymorphisms and amino acid substitutions. To further facilitate understanding, the pipeline also produces graphical summaries of its results.

Although it can be run in a single command, MetaPop is divided into 4 modules: (1) Preprocessing, (2) Microdiversity, (3) Macrodiversity, and (4) Visualization.

The preprocessing module is mandatory for the other modules, but also independently allows for a user to filter their reads for length, quality of alignment as measured through percent identity, and to calculate the depth and breadth of coverage on genomes within each sample. More details on these processes are available in section 4 of this document.

The macrodiversity module will calculate normalized abundances of species across all of the samples in a dataset and will calculate and visualize the results of a wide range of diversity indices. Included indices are available in section 4 of this document.

The microdiversity module identifies single nucleotide polymorphism (SNP) loci and performs quality assessments of these loci to ensure their accuracy. Where variant loci occur on predicted genes, MetaPop assigns the loci to all genes on which they fall, and updates the base calls of both the original reference genomes and gene calls to an all-sample consensus at each variant locus. Next, codons containing multiple SNP loci are assessed for co-occurrence to determine amino acid behavior more accurately, with more detail available in section 4. Finally, the result of these preparatory steps are used to calculate within-genome diversity measures per sample, details of which are available in section 4.

The visualization module contains a component designed to summarize preprocessing and a component designed to visualize microdiversity results. The preprocessing component includes a report of the counts of reads retained and removed in each sample, and a per-sample assessment of the depth and breadth of coverage of each genome within the sample. The microdiversity component contains plots of the depth of coverage over the length of each genome, nucleotide diversity statistics measured on each gene, and interpretations of selective pressure for each genome of each sample. It also produces a summary of codon usage bias observed in each genome to provide a starting point for assessing horizontally transferred genes.

Installation

MetaPop is a combination of python and R scripts which makes additional calls to a variety of external tools and command line utilities. It is expected to be run from within a unix environment. A unix environment can be accessed through a natively unix system, a virtual machine, or an appropriate platform such as the windows ubuntu terminal.

In order to run MetaPop, a user will require the following software:
Python 3.6+
R
Samtools 1.9
BCFTools
Prodigal

The following Python packages are required:
Pysam
Numpy

Additionally, the following R packages are required:
doParallel
data.table
stringr
ggplot2
ggrepel
cowplot
vegan
compositions
pheatmap
RColorBrewer
bit64
gggenes

MetaPop has been packaged with conda and pip. To use Metapop, we recommend first creating a conda environment. To set up a conda environment, the user must first install anaconda or miniconda (https://docs.anaconda.com/anaconda/install/) on your unix system. Then the user can create a conda environment.

conda create --name metapop python=3.7
conda activate metapop
conda install -y -c conda-forge mamba
mamba install -y bcftools samtools prodigal numpy pysam r-ggrepel r-base r-data.table r-ggplot2 r-rcolorbrewer r-doparallel r-cowplot r-bit64 r-gggenes r-stringr r-vegan r-compositions r-pheatmap -c bioconda -c conda-forge -c r
pip install metapop

MetaPop can also be used on iVirus, as part of Cyverse (https://de.cyverse.org/apps/agave/MetaPop-1.0.0)

Usage

MetaPop’s function is contained within a python script designed to be run from the command line. Usage should always start with metapop, with arguments that follow:

metapop --input_samples [BAM file directory] --reference [path to reference genome file] --norm [library counts normalization file] [OPTIONS]

Entering this command on the command line will process the mapped reads in the directory specified by -dir, with the file the reads were mapped against specified by -assem, and a tab-delimited file containing sample names and library read counts with -ct.

Arguments:

Mandatory arguments:
--input_samples : Directory containing mapped read files in BAM format (files do not need to be sorted and indexed)
--reference : Absolute path to assembled contigs
--norm : Absolute path to counts normalization file in tsv format (only mandatory if performing macrodiversity calculations)

To convert SAM to BAM files, you can use the following command once you activate your conda environment:
samtools view -S -b sample.sam > sample.bam

ct file example
(EXCLUDE .bam suffix & table headers below)
BAM File Names
Number Reads or Bps in Read Library
readlibrary1 52000000
readlibrary2 43000000
readlibrary3 73000000

Program options:
--preprocess_only : Flag indicating to filter reads for %ID, length, and depth of coverage and stop.
--no_micro --no_macro : 2 Flags indicating to only perform preprocess
--micro_only : Flag indicating to perform only macrodiversity calculations. Assumes preprocess has been done.
--macro_only : Flag indicating to perform only microdiversity calculations. Assumes preprocess has been done.
--viz_only : Flag indicating to only produce visualizations. Assumes preprocess has been done.
--no_micro : Flag indicating to skip microdiversity and only perform preprocess and macrodiversity
--no_macro: Flag indicating to skip macrodiversity and only perform preprocess and microdiversity
--no_viz : Flag indicating to not attempt to visualize results.
--threads INT : Number of threads to parallelize processes for. Default 4.

Preprocessing Arguments:
--genes : absolute path to prodigal FASTA format gene file for assembled contigs. May be left absent if you want metapop to generate this file. Please note if you supply your own gene file, the header for each gene in the fasta format neeeds to be in prodigal format.

            Prodigal FASTA header format example:
            >contig1234_1 # 2 # 181 # 1 # ID=1_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.628
            >contig name_gene number # start position # end position # template(1) or antisense(-1) strand # NA

--id_min INT : reads below this percent identity (mismatch/alignment length) are removed. Use -global to calculate as (mismatch/read length). Default 95.
--min_len INT : reads with alignments shorter than this are removed. Default 30.
--min_cov INT : contigs with breadth of coverage (#bases covered/contig length) less than this are removed from microdiversity. Default 20.
--min_dep INT : contigs with truncated average depth of coverage (mean of the 10th - 90th percentile depths of coverage) less than this are removed from microdiversity. Default 10.
--trunc INT : sets the percentiles at which depths of coverage will be truncated for depth. Default 10.
--global FLAG : Flag indicating to calculate percent identity using read length instead of alignment length.

Variant Calling Arguments:
--ref_sample STRING : prefix of sample to be used as the SNP reference point for all other samples. A prefix is the set of characters before .bam extension in a sample, e.g. a_file.bam has prefix a_file.
--min_obs INT : Minimum number of observations of a variant allele at a particular locus for it to be called a SNP. Default 4
--min_pct INT : Minimum percent of the population a variant allele at a particular locus must represent for it to be called a SNP. Default 1
--min_qual INT : Minimum PHRED score for a base to be used in initial variant calling. Default 20.

Microdiversity Arguments:
--subsample_size INT : SNP loci will be subsampled proportionallydown to this depth of coverage for microdiversity calculations. Default 10.

Macrodiversity Arguments:
--subsample_size FLAG : Assumes that each contig supplied is a complete bacterial genome. Lowers the threshold of detection from 70% coverage to 20% coverage.
--minimum_bases_for_detection INT : Minimum number of positions required to be covered for a contig to be considered detected. Default 5000.

Visualization arguments:
--plot_all FLAG : Metapop will print all contigs from microdiversity results. This will likely take a long time. Default prints top 3 genomes by highest % of genes under positive selection in each sample.
--snp_scale [local, global, both] : Metapop will print microdiversity results using SNPs assessed at the local (per sample) or global (across all samples) levels, or both. Defaults to local.

Test Data

A test dataset and the expected results are available on iVirus for download: https://datacommons.cyverse.org/browse/iplant/home/shared/iVirus/MetaPop_Testdata

Function Details

Several sections of MetaPop’s code perform functions which are difficult to briefly describe. In order to organize this README better, more complete descriptions of complex function behavior and their relevant user arguments are placed here instead of with their briefer descriptions in other sections.

Preprocessing Module:
The preprocessing component of MetaPop is aimed at filtering datasets for high quality reads, then identifying the genomes which are well-covered using only these reads.

The first filter applied is for alignment length. Reads which themselves are short, or which align only over small portions of their length are more likely to be spurious. MetaPop removes reads with alignments shorter than 30 bp by default, which can be modified with the -min option.

Percent identity is then calculated on the remaining reads as the number of base pairs within a read which match the reference genome divided by the length of the alignment (or the entire length of the read if the -global flag is set). By default, reads mapping at less than 95% identity will be removed, but this value can be changed with the -id argument.

Depth and breadth of coverage per genome are then calculated using only reads passing these initial two filters. Breadth of coverage is simply the percent bases in a genome covered by at least 1 read. Depth, however, is calculated as Truncated Average Depth (TAD) which is more complex.

To calculate TAD, depth is counted for every position in the genome (including zeroes), and then these counts are sorted. The bottom and top 10% of depths are removed (e.g. in a 100 bp genome, the lowest 10 and highest 10 depths would be removed) and the average is taken over the middle 80%. This approach provides for robustness against spurious read recruitment to highly conserved regions of the genome and toward genomes with low breadth of coverage. The quantiles at which reads are truncated (default 10) may be set at different values using the -trunc argument.

Macrodiversity Indices:
The macrodiversity module calculates the following indices of sample-level diversity statistics from abundance data created during preprocessing:

Alpha diversity:
-Raw abundance counts
-Abundance counts normalized across samples of different sizes
-Chao diversity index
-ACE index
-Shannon’s H
-Simpson’s index
-Inverse Simpson’s
-Fisher alpha diversity
-Peilou’s J

Beta diversity:
-Bray-Curtis dissimilarity
-Center Log Ratio Euclidean distances
-Jaccard distances

SNP Linking:
During the calling of SNPs, MetaPop assigns SNPs to their respective genes, and notes their positions within each gene in terms of the codons to which they belong and their positions in the codons. While the positions within the codons are used to ensure that gene and SNP calls are behaving as expected and producing more SNPs in the 3rd positions of codons when compared the first or second positions, MetaPop also notes when 2 or more SNPs occur on the same codon.

When two or more SNPs are located on the same codon of the same gene, MetaPop directly examines the reads in each sample to determine the precise behavior of that codon in each sample. Because even short reads are likely to fully contain individual codons, MetaPop is able to determine if the SNPs occurring on the same codon are occurring simultaneously on the same reads (indicating that the variant codon is the composite of these two SNP loci), or whether the SNPs are mutually exclusive (indicating two distinct variants), or whether they occasionally co-occur and occasional occur separately.

The amino acids resulting from the codons examined in this process are retained for the later calculation of the ratio of synonymous and non-synonymous mutations. This renders these calculations reflective of the actual behavior of the variant sites, rather than approximations.

Microdiversity Indices:
The microdiversity module of MetaPop utilizes the SNP calls and linked SNP data generated earlier to quantify nucleotide diversity within each genome and to assess evidence of selective pressures being applied to genes on each genome.

Microdiversity calculations are divided into local and global scales, with the local scale representing the set of variant loci identified independently within each sample, and the global scale representing the set of variant loci identified over a genome across all samples.

Microdiversity indices include:
-Pi nucleotide diversity
-Watterson’s Theta nucleotide diversity
-Tajima’s D
-pN/pS ratio of non-synonymous to synonymous mutations
-Codon usage biases of genes on each genome
-Fixation index (Fst) between samples where the same genome is observed at sufficient depth of coverage