An evolutionary context-integrated deep learning framework for protein engineering
ECNet (evolutionary context-integrated neural network) is a deep learning model that guides protein engineering by predicting protein fitness from the sequence. It integrates local evolutionary context from homologous sequences that explicitly model residue-residue epistasis for the protein of interest with the global evolutionary context that encodes rich semantic and structural features from the enormous protein sequence universe. Please see our Nature Communications paper for details.
Clone and export the GitHub repository directory to python path
git clone https://github.com/luoyunan/ECNet.git
cd ECNet
export PYTHONPATH=$PWD:$PYTHONPATH
This package is tested with Python 3.7
and CUDA 10.1
on Ubuntu 18.04
, with access to an Nvidia GeForce TITAN X GPU (12GB RAM) and Intel Xeon E5-2650 v3 CPU (2.30 GHz, 512G RAM). Please see requirements.txt
for necessary python dependencies, all of which can be easily installed with pip
or conda
. Due to an issue of installing pytorch 1.4.0
with pip
, please install pytorch
with conda
first.
conda install pytorch==1.4.0 cudatoolkit=10.1 -c pytorch
pip install -r requirements.txt
wget https://www.dropbox.com/s/nkgubuwfwiyy0ze/data.tar.gz
tar xf data.tar.gz
CUDA_VISIBLE_DEVICES=0 python scripts/run_example.py \
--train data/RRM_single.tsv \
--fasta data/RRM.fasta \
--local_feature data/RRM.braw \
--output_dir ./output/RRM_CV \
--save_prediction \
--n_ensembles 2 \
--epochs 100
It typically takes no more than 15 min on our tested environment to run this example. The output (printed to stdout) would be the correlation between predicted and ground-truth fitness values.
ECNet has two required input files: 1) a FASTA file of the wild-type sequence, and 2) a TSV file describes the fitness values of variants. Other optional input files include the output of CCMPred for extracting local features and separate test TSV file.
--fasta
, required). A regular FASTA file of the wild-type sequence. This file should contain only one sequence.--train
, required). Each line has two columns mutation
and score
separated by tab, describing the fitness value of a variant. The mutation
column is a string has the format [ref][pos][alt]
, e.g., S100T
, meaning that the 100-th amino acid (index starting from 1) mutated from S
to T
. If a variant has multiple mutations, ;
is used to concatenated mutations. The score
column is a numerical value quantifies the variant's fitness. Example:
mutation score
M1S 1.0
F12I;L30K 2.0
G89A 0.06
Note: This file is supplied using the --train
argument. If no separate test data is provided through the --test
argument, this TSV file will be split into three sets (train, valid, and test) using ratio specified by --split_ratio
(which are 3 float numbers). If there is another test TSV file is provided, this TSV file will be split into two sets (train and valid) as specified by --split_ratio
(which are 2 float numbers).
--local_feature
, optional). A binary file generated by CCMPred using the -b
option (note that to use the -b
option you need to install CCMPred from its latest GitHub branch instead of the release; you may also need to install libmsgpack-dev
. See instructions below). ECNet will extract local features from this file. This file is optional. If not provided, please add --no_local_feature
flag when running run_example.py
(or, equivalently, set use_local_features=False
for the ECNet
class) and ECNet won't use the local features. See below for instruction of generating this binary file using HHblits and CCMPred. --test
, optional). This file has the same format as the --train
TSV file.We suggest users tune hyperparameters for new protein. Several hyperparameters are exposed as arguments, e.g., d_embed
, d_model
, d_h
, n_layers
, etc.
-b
option is not available. Also, as CCMPred uses msgpack
to create the binary file, you may also need to install libmsgpack-dev
on your system if it is not available. For example, on Ubuntu, you can run sudo apt update
then sudo apt install libmsgpack-dev
.example.fasta
of the wild-type sequence of our interested protein.hhblits
in HHsuite. (There multiple ways to search homologous sequences and format the alignment. Below we describe a way that uses hhblits to search homologous sequences. Other ways are also feasible, e.g., using jackhmmer as described in the DeepSequence paper.)
hhblits -i example.fasta \
-d ${path_to_hhblits_database} \
-o example.hhr \
-oa3m example.a3m \
-n 3 \
-id 99 \
-cov 50 \
-cpu 8
reformat.pl
script from the hh-suite/scripts
directory to get an alignment in fasta format and then the convert_alignment.py
from the CCMpred/scripts
directory to get the PSICOV format:
${path_to_hh-suite}/scripts/reformat.pl example.a3m example.fas -r
python ${path_to_CCMpred}/scripts/convert_alignment.py example.fas fasta example.psc
ccmpred example.psc example.mat -b example.braw -d 0
--local_feature example.braw
to provide the local features to ECNet.The following example shows how to train ECNet on dataset A (passed via --train
) and test it on another dataset B (passed via --test
).
CUDA_VISIBLE_DEVICES=0 python scripts/run_example.py \
--train data/RRM_single.tsv \
--test data/RRM_double.tsv \
--fasta data/RRM.fasta \
--split_ratio 0.9 0.1 \
--local_feature data/RRM.braw \
--output_dir ./output/RRM \
--save_checkpoint \
--n_ensembles 2 \
--epochs 100
--save_model_dir
argument and predict for test dataset:
CUDA_VISIBLE_DEVICES=0 python scripts/run_example.py \
--test data/RRM_double.tsv \
--fasta data/RRM.fasta \
--local_feature data/RRM.braw \
--n_ensembles 2 \
--output_dir ./output/RRM \
--saved_model_dir ./output/RRM
Luo, Y. et al. ECNet is an evolutionary context-integrated deep learning framework for protein engineering. Nat Commun 12, 5743 (2021). https://doi.org/10.1038/s41467-021-25976-8
@article{luo2021ecnet,
doi = {10.1038/s41467-021-25976-8},
url = {https://doi.org/10.1038/s41467-021-25976-8},
year = {2021},
month = sep,
publisher = {Springer Science and Business Media {LLC}},
volume = {12},
number = {1},
author = {Yunan Luo and Guangde Jiang and Tianhao Yu and Yang Liu and Lam Vo and Hantian Ding and Yufeng Su and Wesley Wei Qian and Huimin Zhao and Jian Peng},
title = {{ECNet} is an evolutionary context-integrated deep learning framework for protein engineering},
journal = {Nature Communications}
}
Please submit GitHub issues or contact Yunan Luo (luoyunan[at]gmail[dot]com) for any questions related to the source code.