smirarab / ASTRAL

Accurate Species TRee ALgorithm
Apache License 2.0
229 stars 66 forks source link

New Version:

Note: A new version of this software, implemented in C and with a different algorithm is now available at https://github.com/chaoszhang/ASTER. This updated version has:

We encourage using the new code.

The main features of the Java version (this repository) are available in the C code linked above. Some of the less commonly used features may not be available. Please feel free to submit requests for new features on the other Github as issues.

DESCRIPTION:

ASTRAL is a tool for estimating an unrooted species tree given a set of unrooted gene trees. ASTRAL is statistically consistent under the multi-species coalescent model (and thus is useful for handling incomplete lineage sorting, i.e., ILS). ASTRAL finds the species tree that has the maximum number of shared induced quartet trees with the set of gene trees, subject to the constraint that the set of bipartitions in the species tree comes from a predefined set of bipartitions. This predefined set is empirically decided by ASTRAL (but see tutorial on how to expand it). The current code corresponds to ASTRAL-III (see below for the publication).

The algorithm was designed by Tandy Warnow and Siavash Mirarab originally. ASTRAL-III incorporates many ideas by Chao Zhang and Maryam Rabiee. Code developers are mainly Siavash Mirarab, Chao Zhang, Maryam Rabiee, and Erfan Sayyari.

Bug Reports:

Contact astral-users@googlegroups.com or post on ASTRAL issues page.

Other branches

NOTE: Several new features of ASTRAL are not merged in this branch and are available in other branches or git pages. Please use those branches if you find these features useful.

Publications

Papers on the current version:

Papers on older versions:

Papers with relevance to ASTRAL:

These papers do not describe features in ASTRAL, but are also relveant and we encourage you to read them:

  1. ASTRAL-Pro: This paper extends the ASTRAL methodology to multiple copy genes.
    • Zhang, Chao, Celine Scornavacca, Erin K Molloy, and Siavash Mirarab. “ASTRAL-Pro: Quartet-Based Species-Tree Inference despite Paralogy.” Edited by Jeffrey Thorne. Molecular Biology and Evolution, September 4, 2020, msaa139. https://doi.org/10.1093/molbev/msaa139.
    • ASTRAL-constrained: This paper shows how to impose user-defined constraints on ASTRAL
    • Rabiee, Maryam, and Siavash Mirarab. “Forcing External Constraints on Tree Inference Using ASTRAL.” BMC Genomics 21, no. S2 (April 16, 2020): 218. https://doi.org/10.1186/s12864-020-6607-z.
    • DiscoVista: This paper shows how quartet scores (more broadly, genome discordance) can be visualized in interpretable ways. The visualization of quartet scores, in particular, is closely tied to the ASTRAL method.
    • Sayyari, Erfan, J.B. James B. Whitfield, and Siavash Mirarab. 2018. “DiscoVista: Interpretable Visualizations of Gene Tree Discordance.” Molecular Phylogenetics and Evolution 122 (May): 110–15. doi:10.1016/j.ympev.2018.01.019.
    • Fragmentary data: The following paper made the case that before inferring gene trees, removing fragmentary data (e.g., those that have uncharacteristically large numbers of gaps) should be removed. It also showed RAxML gene trees are preferable to FastTree trees.
    • Sayyari, Erfan, James B Whitfield, and Siavash Mirarab. 2017. “Fragmentary Gene Sequences Negatively Impact Gene Tree and Species Tree Reconstruction.” Molecular Biology and Evolution 34 (12): 3279–91. doi:10.1093/molbev/msx261.
    • Missing data: The following paper showed that excluding genes because they have missing data is often detrimental to accuracy.
    • Molloy, Erin K., and Tandy Warnow. 2018. “To Include or Not to Include: The Impact of Gene Filtering on Species Tree Estimation Methods.” Systematic Biology 67 (2): 285–303. doi:10.1093/sysbio/syx077.
    • TreeShrink: This paper introduced a method for removing very long branches from gene trees in a statistically motivated way. These branches make gene trees less accurate.
    • Mai, Uyen, and Siavash Mirarab. 2018. “TreeShrink: Fast and Accurate Detection of Outlier Long Branches in Collections of Phylogenetic Trees.” BMC Genomics 19 (S5): 272. doi:10.1186/s12864-018-4620-2.
    • Sample Complexity: This paper established the theoretical sample complexity (i.e., number of required genes) for ASTRAL.
    • Shekhar, Shubhanshu, Sebastien Roch, and Siavash Mirarab. 2018. “Species Tree Estimation Using ASTRAL: How Many Genes Are Enough?” IEEE/ACM Transactions on Computational Biology and Bioinformatics 15 (5): 1738–47. doi:10.1109/TCBB.2017.2757930.
    • INSTRAL: introduces an ASTRAL-based algorithm for adding new species unto an existing species tree; so, the phylogenetic placement problem but for species trees.
    • Rabiee, Maryam, and Siavash Mirarab. 2018. “INSTRAL: Discordance-Aware Phylogenetic Placement Using Quartet Scores.” BioRxiv 432906. doi:10.1101/432906.
    • BestML: This paper was published before ASTRAL but showed that using best ML gene trees is often preferable to using the consensus of running summary methods on bootstrapped gene trees.
    • Mirarab, Siavash, Md Shamsuzzoha Bayzid, and Tandy Warnow. 2016. “Evaluating Summary Methods for Multilocus Species Tree Estimation in the Presence of Incomplete Lineage Sorting.” Systematic Biology 65 (3). Oxford University Press: 366–80. doi:10.1093/sysbio/syu063.

Documentations

INSTALLATION:

EXECUTION:

ASTRAL currently has no GUI. You need to run it through the command-line. In a terminal, go the location where you have downloaded the software, and issue the following command:

  java -jar astral.5.7.8.jar

This will give you a list of options available in ASTRAL.

To find the species tree given a set of gene trees in a file called in.tree, use:

java -jar astral.5.7.8.jar -i in.tree

The results will be outputted to the standard output. To save the results in a file use the -o option (Strongly recommended):

java -jar astral.5.7.8.jar -i in.tree -o out.tre

To save the logs (also recommended), run:

java -jar astral.5.7.8.jar -i in.tree -o out.tre 2>out.log
Input:
species_name [number of individuals] individual_1 individual_2 ...

species_name:individual_1,individual_2,...

Note that when multiple individuals exist for the same species, your species name should be different from the individual names.

Output:

The output in is Newick format and gives:

The ASTRAL tree leaves the branch length of terminal branches empty. Some tools for visualization and tree editing do not like this (e.g., ape). In FigTree, if you open the tree several times, it eventually opens up (at least on our machines). In ape, if you ask it to ignore branch lengths all together, it works. In general, if your tool does not like the lack of terminal branches, you can add a dummy branch length, as in this script.

Other features (local posterior, bootstrapping):

Please refer to the tutorial for all other features, including bootstrapping, branch annotation, and local posterior probability.

Memory:

For big datasets (say more than 1000 taxa), increasing the memory available to Java can result in speedups. Note that you should give Java only as much free memory as you have available on your machine. So, for example, if you have 8GB of free memory, you can invoke ASTRAL using the following command to make all the 8GB available to Java:

java -Xmx8000M -jar astral.5.7.8.jar -i in.tree

Acknowledgment

ASTRAL code uses bytecode and some reverse engineered code from PhyloNet package (with permission from the authors). Code is contributed by Siavash Mirarab, Maryam Rabiee, Chao Zhange, Erfan Sayyari, and John Yin.