smirarab / binning

Code for statistical binning and related scripts
6 stars 4 forks source link

Statistical binning refers to the method that we introduced in the following paper

The code provided here creates the supergene alignmetns (e.g. bins) given a set of bootstrapped gene trees, one per each gene.

Requirements

The current pipeline has been developed and used only on Linux. It should run on all nix platforms. The current pipeline uses a Condor cluster for running jobs on distributed systems in parallel. However, changing the pipeline to work with other clusters should be relatively straight-forward. Also, if you don't have a cluster, you can use the pipeline to build command files that can be run in serial.

The following are required.

Installation:

After installing the requirements, unzip the code directory to a location of choice and define BINNING_HOME environmental variable to the location where the files are extracted.

USAGE:

The assumption of the pipeline is that you have all your initial gene trees with support values drawn on them in a certain directory (we will call it genes_dir). Inside genes_dir, you need to have one folder per gene. Inside that folder, you need to have the alignments and the gene tree with support values. The name of each gene (gene_name) is just the name of its directory. The name of the alignment file should be gene_name.fasta. So, if your gene is called 300, your alignment should be called 300.fasta. The tree file can be named anything, However, all the trees should have the same exact name. So, for example, the tree for gene 300 could be under directory genes_dir/300/RAxML_bootstrap.final, and if so, the tree for gene 301 should be under genes_dir/301/RAxML_bootstrap.final. We will refer to the name of your tree file as tree_file_name.

Once you have your initial gene trees in this structure, to get the bins, you need to:

Step 1:

   $BINNING_HOME/makecondor.compatibility.sh [genes_dir] [support_threshold (e.g. 50)] [pairwise_output_dir] [tree_file_name]

This generates a condor file (condor.compat.[support]) that has all the commands required for calculating all pairwise compatibilities.

   $BINNING_HOME/makecommands.compatibility.sh [genes_dir] [support_threshold (e.g. 50)] [pairwise_output_dir] [tree_file_name]

This creates a bash file that includes all the commands that need to be run.

Step 2:

condor_submit condor.compat.[threshold]

Step 3:

Once your runs from previous step have finished, it is time to build the bin definitions. Run (be sure to replace 50 with support threshold you used in step 1):

   cd [pairwise_output_dir]; 
   ls| grep -v ge|sed -e "s/.50$//g" > genes   
   python $BINNING_HOME/cluster_genetrees.py genes 50

Once this finished, you have all your bins defined in text files called bin.0.txt, bin.1.txt, etc inside the [pairwise_output_dir] directory. You can look at these and examine bin sizes if you want (e.g. run wc -l [pairwise_output_dir]/bin.*.txt).

Step 4:

Now is time to actually concatenate all the gene alignments for each bin and to create the supergene alignments. Go to the place where you want to have your supergene alignments saved, and run:

   $BINNING_HOME/build.supergene.alignments.sh [pairwise_output_dir] [genes_dir] [supergenes_output_directory]

This will create the directory given with [supergenes_output_directory] and will put all the supergene alignments in there. For each supergene, it will also create a .part file that tells you what genes are put in the supergene alignment, and the boundary between those genes.
Note: For bins of size 1, if they exist, a warning is issued. This warning should be taken into accont. You already have a gene tree on these singelton bins, and therefore you might just want to make symlinks instead of rerunning the gene tree estimation step. Look at $BINNING_HOME/build.supergene.alignments.sh for commented lines that can be adjusted to create symlinks for your file structure.

Step 5:

Now you can use your favorite tree estimation software to estimate gene trees for each of these supergenes. The .part files can be used for a strongly recommended partitioned analysis (e.g. using RAxML).

Step 6:

Run your supergenes through your favorite summary method (e.g. ASTRAL, MP-EST, etc.). For doing this, you might want to 1) use MLBS, and 2) weight each bin by the number of genes put in that bin. See notes below for both of these extra steps.


Notes:

Acknowledgment

In our pipeline, we use few scripts developed by others:

  1. Our graph coloring code is a modification of a Java code by Shalin Shah.
  2. Our compatibility code is based on an older versions of phylonet with some reverse-engineering and code modifications to increase speed and fix bugs.