Code and benchmark data associated with the paper "An integrative approach to protein sequence design through multiobjective optimization".
This package provides a demonstration of how evolutionary multiobjective optimization techniques can be used to coherently integrate multiple models into the computational protein sequence design process, by 1) directly embedding models into the mutation operator to bias sampling in the sequence space, and 2) explicitly approximating the Pareto front in a user-specified objective space. The main advantage of this approach is that it outperforms and obviates the need for post hoc filtering in a multiobjective protein design problem; we anticipate this approach to be broadly relevant for problems with complex design specifications that cannot be easily encapsulated by a single model or objective function.
Clone the repo and pip install .
from the repo root directory to install the package. Take a look at the RfaH benchmark code in RfaH_benchmark/
and the docstrings in __init__.py
, which provides the primary user interface for setting up a simulation.
Besides pip install
, the repo can also be packaged into a .whl
file using python -m build --wheel
from the root directory.
To use the AF2Rank objective function, the following dependency requirements need to be properly configured:
colabdesign
(version 1.1.1): pip -q install git+https://github.com/sokrypton/ColabDesign.git@v1.1.1
; see the linked repo for more information.alphafold_multimer_v2
model parameters. Newer versions of the multimer model parameters can be found through the alphafold
github repo, but these parameters have not been tested with the current code.g++ -static -O3 -ffast-math -lm -o TMscore TMscore.cpp
.TMscore
binary file will need to be passed as arguments to a wrapper.ObjectiveAF2Rank
object.To use the ESM models, install the pgen
/protein_gibbs_sampler
package, following the instructions therein. The ESM model parameters will be downloaded automatically the first time the model is called.
To parallelize calculations using MPI, install the mpi4py
package.
Note that this repo contains a vendorized version of ProteinMPNN (version 1.0.1) and AF2Rank.
Change the line logger.setLevel(logging.WARN)
in utils.get_logger()
to logger.setLevel(logging.DEBUG)
to print out debugging information.
As long as torch
and jax
are properly configured, the code should automatically detect and utilize available GPUs. To force CPU computation, set the device
argument to cpu
for the relevant wrapper objects.
The code supports three modes of parallelization:
mpi4py
communicator object (e.g., mpi4py.MPI.COMM_WORLD
) to the comm
argument (and set the cluster_parallelization
argument to False, if necessary), and then wrap the Python interpreter call with the MPI environment (e.g., mpirun -np <population_size> python3 <job_script>.py
). In theory, MPI should be able to understand hybrid compute environment with multiple available CPU and GPU cores, if it has been compiled properly, but this setup has not been tested.cluster_parallelization
(and possibly cluster_parallelize_metrics
) to True
to enable behavior. The user needs to modify the utils.sge_write_submit_script()
function directly to configure the submission scripts in accordance with the available local compute environment. It should be possible to adapt the code to utilize other job schedulers, such as SLURM, by updating the appropriate commands in utils.sge_write_submit_script()
, utils.sge_submit_job()
, and utils.cluster_manage_job()
.Caveats
temp_dir
argument.The genetic algorithms in this repo have been benchmarked against three model systems: RfaH, PapD, and CaM. See the benchmarks folder for more information on the data and code for the benchmark anlaysis of each of these model systems.