luoyunan / ECNet

An evolutionary context-integrated deep learning framework for protein engineering
BSD 3-Clause "New" or "Revised" License
63 stars 16 forks source link

ECNet

An evolutionary context-integrated deep learning framework for protein engineering

Overview

ECNet (evolutionary context-integrated neural network) is a deep learning model that guides protein engineering by predicting protein fitness from the sequence. It integrates local evolutionary context from homologous sequences that explicitly model residue-residue epistasis for the protein of interest with the global evolutionary context that encodes rich semantic and structural features from the enormous protein sequence universe. Please see our Nature Communications paper for details. ECNet

Installation

Clone and export the GitHub repository directory to python path

git clone https://github.com/luoyunan/ECNet.git
cd ECNet
export PYTHONPATH=$PWD:$PYTHONPATH

Dependencies

This package is tested with Python 3.7 and CUDA 10.1 on Ubuntu 18.04, with access to an Nvidia GeForce TITAN X GPU (12GB RAM) and Intel Xeon E5-2650 v3 CPU (2.30 GHz, 512G RAM). Please see requirements.txt for necessary python dependencies, all of which can be easily installed with pip or conda. Due to an issue of installing pytorch 1.4.0 with pip, please install pytorch with conda first.

conda install pytorch==1.4.0 cudatoolkit=10.1 -c pytorch
pip install -r requirements.txt

Quick Example

  1. Download example data (~5.4MB) from Dropbox.
    wget https://www.dropbox.com/s/nkgubuwfwiyy0ze/data.tar.gz
    tar xf data.tar.gz
  2. Run the example script. The following script trains an ECNet model using the fitness data of the second RRM domain of Pab1 (source). The scripts randomly splits 70% as training data, 10% as validation data, and 20% as test data.
    CUDA_VISIBLE_DEVICES=0 python scripts/run_example.py \
        --train data/RRM_single.tsv \
        --fasta data/RRM.fasta \
        --local_feature data/RRM.braw \
        --output_dir ./output/RRM_CV \
        --save_prediction \
        --n_ensembles 2 \
        --epochs 100

    It typically takes no more than 15 min on our tested environment to run this example. The output (printed to stdout) would be the correlation between predicted and ground-truth fitness values.

Running on your own data

ECNet has two required input files: 1) a FASTA file of the wild-type sequence, and 2) a TSV file describes the fitness values of variants. Other optional input files include the output of CCMPred for extracting local features and separate test TSV file.

  1. Sequence FASTA file (--fasta, required). A regular FASTA file of the wild-type sequence. This file should contain only one sequence.
  2. Fitness TSV file (--train, required). Each line has two columns mutation and score separated by tab, describing the fitness value of a variant. The mutation column is a string has the format [ref][pos][alt], e.g., S100T, meaning that the 100-th amino acid (index starting from 1) mutated from S to T. If a variant has multiple mutations, ; is used to concatenated mutations. The score column is a numerical value quantifies the variant's fitness. Example:
    mutation    score
    M1S         1.0
    F12I;L30K   2.0
    G89A        0.06

    Note: This file is supplied using the --train argument. If no separate test data is provided through the --test argument, this TSV file will be split into three sets (train, valid, and test) using ratio specified by --split_ratio (which are 3 float numbers). If there is another test TSV file is provided, this TSV file will be split into two sets (train and valid) as specified by --split_ratio (which are 2 float numbers).

  3. Local features (--local_feature, optional). A binary file generated by CCMPred using the -b option (note that to use the -b option you need to install CCMPred from its latest GitHub branch instead of the release; you may also need to install libmsgpack-dev. See instructions below). ECNet will extract local features from this file. This file is optional. If not provided, please add --no_local_feature flag when running run_example.py (or, equivalently, set use_local_features=False for the ECNet class) and ECNet won't use the local features. See below for instruction of generating this binary file using HHblits and CCMPred.
  4. Additional test TSV file (--test, optional). This file has the same format as the --train TSV file.

We suggest users tune hyperparameters for new protein. Several hyperparameters are exposed as arguments, e.g., d_embed, d_model, d_h, n_layers, etc.

Generate local features using HHblits and CCMPred

  1. Install HHsuite and CCMPred following their instructions. Note that CCMPred should be installed from the latest branch instead of the release, otherwise the -b option is not available. Also, as CCMPred uses msgpack to create the binary file, you may also need to install libmsgpack-dev on your system if it is not available. For example, on Ubuntu, you can run sudo apt update then sudo apt install libmsgpack-dev.
  2. Prepare a FASTA file example.fasta of the wild-type sequence of our interested protein.
  3. Search the homologous sequences of the wild-type sequence using hhblits in HHsuite. (There multiple ways to search homologous sequences and format the alignment. Below we describe a way that uses hhblits to search homologous sequences. Other ways are also feasible, e.g., using jackhmmer as described in the DeepSequence paper.)
    hhblits -i example.fasta \
        -d ${path_to_hhblits_database} \
        -o example.hhr \
        -oa3m example.a3m \
        -n 3 \
        -id 99 \
        -cov 50 \
        -cpu 8
  4. Reformat the a3m output of hhblits to PSICOV format (solution modified from here). In order to run CCMpred, the alignment must be reformatted to the "PSICOV" format used by CCMpred. We can first use the reformat.pl script from the hh-suite/scripts directory to get an alignment in fasta format and then the convert_alignment.py from the CCMpred/scripts directory to get the PSICOV format:
    ${path_to_hh-suite}/scripts/reformat.pl example.a3m example.fas -r
    python ${path_to_CCMpred}/scripts/convert_alignment.py example.fas fasta example.psc
  5. Run CCMPred
    ccmpred example.psc example.mat -b example.braw -d 0
  6. Use the argument --local_feature example.braw to provide the local features to ECNet.

Train on dataset A and test on dataset B

The following example shows how to train ECNet on dataset A (passed via --train) and test it on another dataset B (passed via --test).

Citation

Luo, Y. et al. ECNet is an evolutionary context-integrated deep learning framework for protein engineering. Nat Commun 12, 5743 (2021). https://doi.org/10.1038/s41467-021-25976-8

@article{luo2021ecnet,
  doi = {10.1038/s41467-021-25976-8},
  url = {https://doi.org/10.1038/s41467-021-25976-8},
  year = {2021},
  month = sep,
  publisher = {Springer Science and Business Media {LLC}},
  volume = {12},
  number = {1},
  author = {Yunan Luo and Guangde Jiang and Tianhao Yu and Yang Liu and Lam Vo and Hantian Ding and Yufeng Su and Wesley Wei Qian and Huimin Zhao and Jian Peng},
  title = {{ECNet} is an evolutionary context-integrated deep learning framework for protein engineering},
  journal = {Nature Communications}
}

Contact

Please submit GitHub issues or contact Yunan Luo (luoyunan[at]gmail[dot]com) for any questions related to the source code.