This repository contains the code associated with the manuscript
Neher, Russell, Shraiman: "Predicting evolution from the shape of genealogical trees". accepted for publication in eLife
The directory prediction_src contains the code base used for the fitness inference and prediction algorithms as well as classes to hold sequence data and trees adapted.
The directory flu contains the code specific to our analysis of historical influenza data, scripts that generate the figures, the influenza sequences and annotation, analysis results and figure files.
The directory toy_data contains the code to simulate adapting populations building on the FFPopSim library. In addition, it contains scripts to analyze this simulated data, the data itself and the resulting figures.
The script rank_sequences.py is a simple wrapper for the prediction tool that takes a multiple sequence alignment and the name of the outgroup as input (this outgroup needs to be in the MSA). It produces a folder containing a ranking of sequences, the inferred ancestral sequences, the reconstructed tree, and optionally a pdf of the marked up tree. This script uses the local branching index (LBI), rather than the full fitness inference to rank sequences.
build-in help and optional arguments:
./rank_sequences.py --help
usage: rank_sequences.py [-h] --aln ALN --outgroup OUTGROUP
[--eps_branch EPS_BRANCH] [--tau TAU]
[--collapse [COLLAPSE]] [--plot [PLOT]]
rank sequences in a multiple sequence aligment
optional arguments:
-h, --help show this help message and exit
--aln ALN alignment of sequences to by ranked
--outgroup OUTGROUP name of outgroup sequence
--eps_branch EPS_BRANCH
minimal branch length for inference
--tau TAU time scale for local tree length estimation (relative
to average pairwise distance)
--collapse [COLLAPSE]
collapse internal branches with identical sequences
--plot [PLOT] plot trees
The script infer_fitness.py also takes an alignment and outgroup as argument, but uses the full fitness inference to rank sequences and calculate the mean posterior and the variance of the posterior. Note that plausible posterior distributions require a that the parameter omega is well chosen. Also, the time conversion factor might need to be different from gamma=1 for optimal results.
./infer_fitness.py --help
usage: infer_fitness.py [-h] --aln ALN --outgroup OUTGROUP
[--eps_branch EPS_BRANCH] [--diffusion DIFFUSION]
[--gamma GAMMA] [--omega OMEGA]
[--collapse [COLLAPSE]] [--plot [PLOT]]
rank sequences in a multiple sequence aligment
optional arguments:
-h, --help show this help message and exit
--aln ALN alignment of sequences to by ranked
--outgroup OUTGROUP name of outgroup sequence
--eps_branch EPS_BRANCH
minimal branch length for inference
--diffusion DIFFUSION
fitness diffusion coefficient
--gamma GAMMA scale factor for time scale, choose high (>2) for
prediction, 1 for fitness inference
--omega OMEGA approximate sampling fraction diveded by the fitness
standard deviation
--collapse [COLLAPSE]
collapse internal branches with identical sequences
--plot [PLOT] plot trees