hcgasser / CAPE

2 stars 1 forks source link
Cape Logo


Controlled Amplitude of Present Epitopes (CAPE)

This is the code repository for the article: "Comparing a language model and a physics-based approach to modify MHC Class-I immune-visibility for the design of vaccines and therapeutics".

Protein therapeutics already have an arsenal of applications that include disrupting protein interactions, acting as potent vaccines, and replacing genetically deficient proteins. Therapeutics must avoid triggering unwanted immune-responses towards the therapeutic protein or viral vector proteins. In contrast, vaccines must support a robust immune-reaction targeting a broad range of pathogen variants. Therefore, computational methods modifying proteins' immunogenicity without disrupting function are needed. While many components of the immune-system can be involved in a reaction, we focus on CTL. These target short peptides presented via the MHC-I pathway. To explore the limits of modifying the visibility of those peptides to CTL within the distribution of naturally occurring sequences, we developed a novel machine learning technique, CAPE-XVAE. It combines a language model with reinforcement learning to modify a protein's immune-visibility. Our results show that CAPE-XVAE effectively modifies the visibility of the HIV Nef protein to CTL. We contrast CAPE-XVAE to CAPE-Packer, a physics-based method we also developed. Compared to CAPE-Packer, the machine learning approach suggests sequences that draw upon local sequence similarities in the training set. This is beneficial for vaccine development, where the synthetic sequence should be representative of the real viral population. Additionally, the language model approach holds promise for preserving both known and unknown functional constraints, which are essential for the immune-modulation of therapeutic proteins. In contrast, CAPE-Packer, emphasizes preserving the protein's overall fold and can reach greater extremes of immune-visibility, but falls short of capturing the sequence diversity of viral variants available to learn from.

Below we describe how to install the software used to obtain our results.

Installation

General Requirements

Setup the container

If not indicated otherwise, commands should be run on the host system. In particular, lines starting with 'H' need to be executed on the host, lines starting with 'C' in the container.

Clone the repository

Create docker image

Finish the setup of the container:

Exit and restart the container

Run Experiments

In the below, we describe how to modify the immune visibility with the CAPE system for the example of the HIV-nef protein. If not stated differently, the below should be run in the container. At first we set some standard values:

export DOMAIN="HIV_nef"
export MHC_Is="HLA-A*02:01+HLA-A*24:02+HLA-B*07:02+HLA-B*39:01+HLA-C*07:01+HLA-C*16:01"

Prepare data

Prepare the MHC Class 1 position weight matrix predictor

Run the following in the container to generate the MHC Class 1 position weight matrices:
MHC-I_rank_peptides.py --output ${PF}/data/input/immuno/mhc_1/MhcPredictorPwm --alleles ${MHC_Is} --peptides_per_length 1000000

CAPE-XVAE

Before generating immune modified sequences, we need to train the VAE model. We have attached a model that is pretrained for generating HIV-Nef sequences. When using this, you can skip the hyper-parameter search step and directly go to generate sequences. To use this pretrained model you only need to set the following two environment variables: export XVAE_MODEL_ID=mlp_1606474 and export XVAE_CKPT_ID=mlp_1606474:last

Hyper-parameter search

Plot training metrics

generate sequences

CAPE-XVAE can generate clean (where impossible tokens and premature stop tokens are removed), as well as dirty sequences (where this is not the case). Here we also introduce the concept of the sequence hash file. To efficiently manage the multitude of sequences, structures, ... accross various analysis, the CAPE system generates a sequence hash for each sequence. If OUTPUT_FOLDER_PATH is not provided, the generated sequences will be stored into /CAPE/artefacts/CAPE-XVAE/jobs/<job id>/generated/baseline (clean) and /CAPE/artefacts/CAPE-XVAE/jobs/<job id>/generated/dirty (dirty). 'clean' sequences will be referred to as baseline going forward. The dirty ones were actually only produced to check whether the system would actually regularly generate premature stop tokens and other impossible tokens.

Also, CAPE uses Loch, which is a directory and library, which stores all fasta, pdb, functional and molecular dynamics files associated with those sequences (hashes). clean sequences will also be stored in this directory. To find sequences in Loch, we require their sequence hashes. These will be stored in the file specified with SEQ_HASH_FILE_PATH. If this is not provided, the standard file path will be used (/CAPE/artefacts/CAPE-XVAE/${DOMAIN}.CAPE-XVAE.baseline.clean/dirty.seq_hash). In the container run:

modify immune visiblity:

The following commands (run in the container) randomly take a natural sequence (in the dataset used to train/validate/test the model) and run the immune-visibility modification process on them. The results will be saved in the loch directory and the sequence hashes can be found in /CAPE/artefacts/CAPE-XVAE/${DOMAIN}.CAPE-XVAE.${PROFILE}.final.seq_hash.

CAPE-Packer

Start CAPE Packer Server

Run CAPE-Packer client

Before the next steps, the PDB files need to be present in the structure_path. Next to other methods, this can be achieve via running the following on the host system:
H: ./tools/run_alphafold.sh ${CAPE} HIV_nef ${CAPE}/artefacts/CAPE/loch

Run CAPE-Eval

Start a Notebook Server

In the container:

Produce 3D Structures, MD Simulations, Functional Predicitons

export SEQ_HASH_FILE=<insert path to seq_hash file>
tools/run_for_seq_hashes.sh "${PF}/tools/MD/md.py --pdb ${LOCH}/structures/AF/pdb/#SEQ_HASH#_AF.pdb --output ${LOCH}/dynamics" $SEQ_HASH_FILE

Run evaluation notebook