lmcai / Coalescent_simulation_and_gene_flow_detection

R and Python scripts used to simulate gene trees under coalescent model and detect gene flow using triple frequency
GNU General Public License v3.0
10 stars 3 forks source link

Coalescent simulation and gene flow detection

This folder contains scripts used in Cai et al (2019) to simulate gene trees under coalescent model with bootstrap and subsequently detect gene flow using triple frequency in gene trees.

Citation: Liming Cai, Zhenxiang Xi, Emily Moriarty Lemmon, Alan R. Lemmon, Austin Mast, Christopher E. Buddenhagen, Liang Liu, Charles C. Davis. 2019. The Perfect Storm: Gene Tree Estimation Error, Incomplete Lineage Sorting, and Ancient Gene Flow Explain the Most Recalcitrant Ancient Angiosperm Clade, Malpighiales. Systematic Biology, syaa083, https://doi.org/10.1093/sysbio/syaa083

License: GPL https://www.gnu.org/licenses/gpl-3.0.html

Usage contact: daybreak.chua@gmail.com

Dependencies

R 3.1+; R package phybase and phytools.

Python 2 or 3; python library ete3

Input and output

Input:

  1. One rooted best-estimated species tree with branch lengths measured in coalescent units (can be estimated from MPEST/ASTRAL on ML gene trees, newick format);

  2. Rooted bootstrap species trees with branch lengths measured in coalescent units (can be obtained by running MPEST/ASTRAL on bootstrap gene trees, newick format);

  3. Rooted gene trees, newick format.

Common error: Phybase has stringent tree format requirements. These input trees have to be rooted. They should not have node labels (e.g., BP support). All branch must have positive branch length. Please note that the species tree generated by ASTRAL can have 0 in branch length. You can replace 0 with small numbers such as 0.00000001. Otherwise Phybase cannot simulate gene trees based on non-bifurcating species trees.

Output:

Output files will be in the working folder:

  1. geneTr_sim/BP*.sim.genetrees: Simulated gene tree sets under multispecies coalescent model with the same amount of missing data as empirical gene trees;

  2. [prefix].trp.csv: Triple frequencies in the empirical gene trees;

  3. geneTr_sim/BP*.trp.csv: Triple frequencies in the simulated gene trees. This is the expectation of triple frequency distribution under coalescent model accounting for missing data and estimation error;

  4. unbalanced.trp.tsv: Triples with significantly unbalanced minor frequencies; 'significant' = if the differences between empirical minor frequencies is larger than the largest differences found in simulation;

  5. unbalancedTriplet.sum.tre: Species tree with node labels reflecting raw numbers of unbalanced triples that are associated with it.

  6. unbalancedTriplet.perc.tre: Species tree with node labels reflecting percentage of unbalanced triples that are associated with it.

How to

Place your input files in the same folder as these scripts;

Follow three steps in simulator_tripletCounter_tripletMapper.sh to simulate gene trees, summarize triplet frequency distribution, and map unbalanced triplets to the species tree;

sh simulator_tripletCounter_tripletMapper.sh [path_to_species_tree] [path_to_bootstrap_species_trees] [path_to_gene_trees]

Counting triple frequencies is the most time consuming step. It can be run in parallel for bootstrap replicates with the python script triple_frequency_counter.py.

Troubleshoot

If you encountered the following errors:

Loading required package: ape
Loading required package: Matrix
Attaching package: ‘phybase’
The following object is masked from ‘package:ape’:
    node.height
Loading required package: maps
Error in if (z[i]) { : missing value where TRUE/FALSE needed
Calls: read.tree
Execution halted

This is most likely due to the incorrect input tree format in the initial simulation step in phybase. Please make sure your species trees DO NOT have node labels, DO NOT MISS any branch lengths, and DO NOT contain 0 length branches. If some branches are inferred to be 0, you can replace it with a really small value such as 1e-5.

Dissecting the relative importance of gene tree estimation error, ILS, and gene flow

All of the three above mentioned factors can generate gene tree variations. Using a regression model, their relative contribution to the overall gene tree variation can be estimated.

We will use the regression method implemented in the R package relaimpo to assess relative importance in linear models.

The input is a matrix of dependent variable (gene tree variation) and three independent variable (gene tree error, ILS, and gene flow) across all internal nodes. Each row corresponds to these values for one internal node. The more internal nodes you have, the more powerful the regression analysis is. It is advised that there should be at least 10 observations per variable, so we ideally want at least 30 internal nodes.

Input

  1. The dependent variable — Gene tree variation

It quantifies gene tree topological variation across the tree. The easiest way to calculate this value is to calculate gene concordance factor (gCF) using bootstrap gene trees in IQTREE (because it accommodates missing taxa). For exampled, if you have 20 genes, you will have 20 x 100 bootstrap gene trees assuming you conducted 100 replicates. You can combine these 2000 gene trees into one file called BSgenetree.trees and calculate gCF in IQTREE.

iqtree -t species.tre --gcf BSgenetree.trees --prefix concord
  1. Gene tree estimation error

It quantifies the anticipated level of analytical error for each node in the species tree, which is largely affected by the length of the internal branch. Its calculation requires alignment simulation, gene tree estimation, and bipartition summarization. Briefly, using the species tree as the reference tree to simulate alignments using substitution parameters inferred from RAxML or IQTREE. Many tools can be used for simulation. Some timeless ones include SeqGen and bppseqgen. You can set the length of the simulated alignment to be equal to the mean/median of your empirical data. I simulate 100 alignments to achieve a stable estimation per node. Replicate number of 200 or more may have vanishing returns.

Next, we infer the gene trees for all simulated alignments using your favourite tree builder such as RAxML or IQTREE. Then we calculate how often each node in the species tree is recovered in the inferred gene trees sim_gene.trees using the bipartition in RAxML.

raxmlHPC -f b -t species.tre -z sim_gene.trees -m GTRGAMMA -n ERR
  1. ILS

It quantifies the level of incomplete lineage sorting in each node of the species tree. We will directly use the theta estimated for each node. To calculate theta, infer a species tree under the coalescent model first using MPEST or ASTRAL. The branch length is in coalescent unit. Then infer its branch length in mutation unit using RAxML or IQTREE using fixed topology. The value of theta for each node can be estimated by dividing the branch lengths estimated from RAxML/IQTREE (mutation units) by that estimated from MPEST/ASTRAL (coalescent units).

  1. Gene flow

It quantifies the level of potential gene flow for each node. We will use the reticulate index from the unbalancedTriplet.perc.tre directly. These values are sotred in the node labels.

Regression analysis in R

Format the data above into a matrix. Make sure each row corresponds to the same internal node. Use the example script ralaimpo.R to conduct importance decomposition. If you have different column names for variables, just make sure you change them accordingly when running the analysis.

An example input file can be found in example/relative_contribution_ILS_Err_Intro.csv. In this example, I scaled all the variables linearly to 1 to 100. This is unnecessary. You can totally use raw numbers.