vlanore / diffsel

A C++ program to detect convergent evolution using Bayesian phylogenetic codon models.
Other
6 stars 2 forks source link

diffsel

Build Status codecov licence CeCILL

How to download and build

To get diffsel from a machine connected to the internet, type in a terminal:

git clone https://github.com/vlanore/diffsel.git

This should create a folder called diffsel (the diffsel root folder). You must go there before compiling diffsel:

cd diffsel

Then, to build diffsel simply run:

make

To check that everything ran well, look into the _build folder to see if executables are present:

ls _build

This should display a list of files. If you see the following files, then diffsel was correctly built:

How to format your data

Diffsel requires a tree and an alignment file to run.

Your alignment file must follow the Phylip format. In addition, the number of bases should be a multiple of 3 (as it will be interpreted as codons). For example, the following file would be a valid alignment for diffsel:

8 6
S0      TCCTGA
S1      AATAGT
S2      GGATTT
S3      AATTCA
S4      CGAAGG
S5      AACGCT
S6      ACGAGT
S7      AATATT

Your tree file must follow the newick format. In addition, nodes should be labelled with "conditions". Conditions are integer labels comprised between 0 and P-1 where P is the total number of conditions. The tree does not need to have branch lenghts. In addition, the leaves of the tree should have the same names as the sequences in your alignment file. For example, the following file would be a valid tree file for diffsel matching the alignment file above (with P=2):

((((((((S0:0,S1:1):0,(S2:0,S3:1):0):0,(S4:0,S5:1):0,(S6:0,S7:1):0):0):0,(S8:0,S9:1):0,(S10:0,S11:1):0):0,(S12:0,S13:1):0,(S14:0,S15:1):0):0):0):0

The datà folder in the diffsel root folder contains examples of data files usable with diffsel.

Remarks about conditions

Conditions correspond to known phenotypes in the input data. What diffsel does is look for differential fitness effects associated with these conditions.

For example, the cyperaceae dataset from Besnard et al. 2009 (available here: https://gitlab.in2p3.fr/pveber/reviewphiltrans/tree/master/example/real_data/besnard2009) has two conditions:

In most cases, you should follow this two-condition approach if you want to use diffsel with your own data. In the original diffsel paper (Parto et al. 2018) three conditions were used:

It seems that in some cases this three-condition approach can give better results but our testing indicates that this slighlty more complex approach is not necessary in most cases.

How to run diffsel

Basic usage for diffsel is (from the diffsel root folder):

_build/diffsel -t <path to tree file> -d <path to alignment file> [options...] <name of run>

The name of the run is a string which is used to create files to store the results of the run. Note that the name of the run must be at the very end of the command.

Useful options include:

For example, to run a diffsel chain called "myrun" on example data (from the data folder), with 3 conditions until 10k points have been written to disc (from the diffsel root folder):

_build/diffsel -t data/samhd1.tree -d data/samhd1.ali -ncond 3 -x 1 10000 myrun

Check MCMC convergence

To ensure that analysis results are correct, one should check that the MCMC chain has converged. You can do that either by using your MCMC convergence tool of choice using the <name of run>.trace files, or by using the provided tracecomp program.

Tracecomp requires two runs with identical parameters. You can invoke it by typing _build/tracecomp -x <burn-in> <name of run 1> <name of run2> from the diffsel root folder (assuming you ran your analysis from the diffsel root folder). If you don't know what burn-in to specify, 20% of the total number of iterations should be a decent default value.

For example, with the example analysis command above, assuming a second identical run named myrun2 was done, the command could be (with a burn-in at 20% of 10000 iterations):

_build/tracecomp -x 2000 myrun1 myrun2

and it would output something like this (example output from a very small run):

setting upper limit to : 11
tmp.trace    burnin : 5 sample size : 6
tmp2.trace   burnin : 5 sample size : 6
name                effsize rel_diff

#logprior           6       1.83775
lnL                 4       1.03644
length              5       1.25434
globent             5       0.632021
selvar1             6       1.11341
statent             5       1.03834
rrent               5       0.550811
diag                6       0

If all lines have a large effsize (>300) and a small rel_diff (<0.1) then your chain has converged well. In the above example, the chain has not converged (but it only had 11 iterations). For test runs, these conditions can be slightly relaxed while still giving decently reliable results (e.g., effsize>100 and rel_diff<0.15).

Important note: there is a known bug in tracecomp where it outputs wrong results when no burn-in is specified via command line. For now, please always use the -x <burn-in> option to manually specify a burn-in.

How to run post-analysis after a diffsel run

When it runs, diffsel produces a series of files with raw data about the run. These files are named <name of run>.<something>, for example myrun.run, myrun.param, myrun.chain, etc...

The diffsel repository contains an executable, called readdiffsel, which can analyze these raw files to see if convergent sites were detected. To run this exectuable, run the command _build/readdiffsel <name of run> from the diffsel root folder (assuming you ran your analysis from the diffsel root folder). For example, if your analysis was run using the example command above, the command for post-analysis is:

_build/readdiffsel myrun

This command produces two files per condition after the first. For example, in the myrun example with 3 conditions, it should have produced the files myrun_1.meandiffsel, myrun_1.signdiffsel, myrun_2.meandiffsel and myrun_2.signdiffsel.

Among the produced files, the interesting one is the *.signdiffsel file for the convergent condition. It contains the list of detected convergent effects. It looks like this:

0       17      0.909091        1.19692
1       3       0.0909091       -0.549043
1       10      0       -0.992412
1       12      0.909091        0.493234
1       17      1       1.29196
2       10      0.909091        1.19423

The columns are: site number, aminoacid number, "convergence probability", and effect strength. Effects are already filtered to keep only those with probability above 0.9 or below 0.1, i.e., significant effects.

"Convergence probability" measures how consistently the corresponding aminoacid has a fitness that is better than the other aminoacids at the same position. See the original diffsel paper for more details.

If the *.signdiffsel file is empty, it means that no effects were detected. You can lower the detection threshold in the src/ReadDiffsel.cpp file. Just change the line that says double cutoff = 0.9; to another value, recompile and rerun readdiffsel.

Another way to use diffsel: convergence pipeline

If you want to use diffsel but don't need to see the low-level MCMC-related files, you might prefer using our convergence detection pipeline. This pipeline is capable of running diffsel, along with other convergence detection methods, and produces a standardized output for all methods (a score between 0 and 1 for each method and for each site). This output might be simpler to understand and use than the post-analysis described above.

The pipeline and instructions are located at: https://gitlab.in2p3.fr/pveber/reviewphiltrans