This is the code repository for the article: "Comparing a language model and a physics-based approach to modify MHC Class-I immune-visibility for the design of vaccines and therapeutics".
Protein therapeutics already have an arsenal of applications that include disrupting protein interactions, acting as potent vaccines, and replacing genetically deficient proteins. Therapeutics must avoid triggering unwanted immune-responses towards the therapeutic protein or viral vector proteins. In contrast, vaccines must support a robust immune-reaction targeting a broad range of pathogen variants. Therefore, computational methods modifying proteins' immunogenicity without disrupting function are needed. While many components of the immune-system can be involved in a reaction, we focus on CTL. These target short peptides presented via the MHC-I pathway. To explore the limits of modifying the visibility of those peptides to CTL within the distribution of naturally occurring sequences, we developed a novel machine learning technique, CAPE-XVAE. It combines a language model with reinforcement learning to modify a protein's immune-visibility. Our results show that CAPE-XVAE effectively modifies the visibility of the HIV Nef protein to CTL. We contrast CAPE-XVAE to CAPE-Packer, a physics-based method we also developed. Compared to CAPE-Packer, the machine learning approach suggests sequences that draw upon local sequence similarities in the training set. This is beneficial for vaccine development, where the synthetic sequence should be representative of the real viral population. Additionally, the language model approach holds promise for preserving both known and unknown functional constraints, which are essential for the immune-modulation of therapeutic proteins. In contrast, CAPE-Packer, emphasizes preserving the protein's overall fold and can reach greater extremes of immune-visibility, but falls short of capturing the sequence diversity of viral variants available to learn from.
Below we describe how to install the software used to obtain our results.
https://github.com/google-deepmind/alphafold/
).$ALPHAFOLD_REPO
needs to point to the path of the AF repo and $ALPHAFOLD_DATA
to the path of the database for the MSA.python3 ${ALPHAFOLD_REPO}/docker/run_docker.py --data_dir=$ALPHAFOLD_DATA...
)If not indicated otherwise, commands should be run on the host system. In particular, lines starting with 'H' need to be executed on the host, lines starting with 'C' in the container.
git clone git clone https://github.com/hcgasser/CAPE.git
export CAPE=<path of repo folder>
cd $CAPE
make -B image
(need to set a container password e.g. 'cape_pwd')docker run --name cape_container --gpus all -it -p 9000:9000 -v ${CAPE}:/CAPE cape
vim ${CAPE}/external/README.md
for how to do this. ./setup/setup_conda.sh
exit
docker start -i cape_container
docker commit cape_container cape_image
In the below, we describe how to modify the immune visibility with the CAPE system for the example of the HIV-nef
protein. If not stated differently, the below should be run in the container.
At first we set some standard values:
export DOMAIN="HIV_nef"
export MHC_Is="HLA-A*02:01+HLA-A*24:02+HLA-B*07:02+HLA-B*39:01+HLA-C*07:01+HLA-C*16:01"
HIV-1_nef.fasta
, Source: https://www.hiv.lanl.gov/components/sequence/HIV/search/search.html, Virus: HIV-1, Subtype: Any subtype, Genomic region: "Nef CDS" ~56k sequences, Download options: unaligned, amino acids).${CAPE}/configs/CAPE/dhparams/HIV_nef.yaml
describes how the system should deal with this raw data.HIV-1_nef.fasta
needs to be placed in ${CAPE}/data/raw/HIV-1_nef.fasta
./cape.py --domain $DOMAIN --task data_raw_to_input
."HIV_nef"
).--DATA \"HIV_nef\"
parameter../cape.py --domain $DOMAIN --task data_create_support --DATA \"HIV_nef\"
Run the following in the container to generate the MHC Class 1 position weight matrices:
MHC-I_rank_peptides.py --output ${PF}/data/input/immuno/mhc_1/MhcPredictorPwm --alleles ${MHC_Is} --peptides_per_length 1000000
Before generating immune modified sequences, we need to train the VAE model.
We have attached a model that is pretrained for generating HIV-Nef sequences.
When using this, you can skip the hyper-parameter search step and directly go to generate sequences.
To use this pretrained model you only need to set the following two environment variables: export XVAE_MODEL_ID=mlp_1606474
and export XVAE_CKPT_ID=mlp_1606474:last
for ((i=1; i<=10; i++)); do cape-xvae.py --task hyp --domain ${DOMAIN} --MODEL \"XVAE_nef_32\" --DATA \"HIV_nef\"; done
--task
parameter determines which yaml file specifies the task to perform (can be found in ${PF}/configs/CAPE-XVAE/tasks/
)MODEL
parameter determines which yaml file specifies the model hyper-parameters (can be found in ${PF}/configs/CAPE-XVAE/mhparams/
). DATA
parameter determines the yaml file for the data (see "Prepare data")cape-xvae.py --domain $DOMAIN --task hyp_ls
the last column of the first line of each entry states the /CAPE/artefacts/CAPE-XVAE/jobs/<job id>
. Each job has a last
checkpoint. So the corresponding checkpoint id would be <job id>:<ckpt>
(e.g. py_12:last
).
export XVAE_CKPT_ID=py_12:last
). export XVAE_MODEL_ID=py_12
)cape-xvae.py --task training_metrics --MODEL_ID \"$XVAE_MODEL_ID\" --Y \"loss+loss_recon+loss_kl\"
. figures
in the job's subfolderCAPE-XVAE can generate clean (where impossible tokens and premature stop tokens are removed), as well as dirty sequences (where this is not the case). Here we also introduce the concept of the sequence hash file. To efficiently manage the multitude of sequences, structures, ... accross various analysis, the CAPE system generates a sequence hash for each sequence.
If OUTPUT_FOLDER_PATH
is not provided, the generated sequences will be stored into /CAPE/artefacts/CAPE-XVAE/jobs/<job id>/generated/baseline
(clean) and /CAPE/artefacts/CAPE-XVAE/jobs/<job id>/generated/dirty
(dirty). 'clean' sequences will be referred to as baseline going forward. The dirty ones were actually only produced to check whether the system would actually regularly generate premature stop tokens and other impossible tokens.
Also, CAPE uses Loch, which is a directory and library, which stores all fasta, pdb, functional and molecular dynamics files associated with those sequences (hashes). clean sequences will also be stored in this directory. To find sequences in Loch, we require their sequence hashes. These will be stored in the file specified with SEQ_HASH_FILE_PATH
. If this is not provided, the standard file path will be used (/CAPE/artefacts/CAPE-XVAE/${DOMAIN}.CAPE-XVAE.baseline.clean/dirty.seq_hash
). In the container run:
cape-xvae.py --task generate --domain $DOMAIN \
--CLEAN True --N_SEQS 100 --CKPT_ID \"$XVAE_CKPT_ID\"
cape-xvae.py --task generate --domain $DOMAIN \
--CLEAN False --N_SEQS 100 --CKPT_ID \"$XVAE_CKPT_ID\"
The following commands (run in the container) randomly take a natural sequence (in the dataset used to train/validate/test the model) and run the immune-visibility modification process on them. The results will be saved in the loch directory and the sequence hashes can be found in /CAPE/artefacts/CAPE-XVAE/${DOMAIN}.CAPE-XVAE.${PROFILE}.final.seq_hash
.
reduced/increased/inc-nat
)export PROFILE=reduced
$PROFILE
immune-visible profile:
cape-xvae.py --task modify --domain $DOMAIN \
--PROFILE \"${PROFILE}\" --MODPARAMS \"modparams\" --MHCs \"$MHC_Is\" \
--SEQ \"natural\" --SEQ_ENC_STOCHASTIC True \
--CKPT_ID \"${XVAE_CKPT_ID}\"
tools/run_CAPE-XVAE_modify.sh ${PROFILE} 100
to perform 100 designstmux new-session -t CAPE_Packer_Server
cape_packer_server.py --port 12345 --pwm_path ${PF}/data/input/immuno/mhc_1/MhcPredictorPwm/pwm --input_files_path ${PF}/data/input/HIV_nef_full
Ctrl + b
then d
Before the next steps, the PDB files need to be present in the structure_path
.
Next to other methods, this can be achieve via running the following on the host system:
H: ./tools/run_alphafold.sh ${CAPE} HIV_nef ${CAPE}/artefacts/CAPE/loch
baseline/reduced/increased/inc-nat
)export PROFILE=reduced
$PROFILE
immune-visible profile:cape_packer_from_seq_hashes.py --domain ${DOMAIN} \
--profile ${PROFILE} --mhc_1_alleles $MHC_Is \
--structure_path "${LOCH}/structures/AF/pdb" --structure_predictor AF \
--seq_hashes "${PF}/artefacts/CAPE/${DOMAIN}.support.seq_hash" \
--output_path "${PF}/artefacts/CAPE-Packer" \
--rosetta_path $ROSETTA_PATH
--structure_path
: the path where the program can find the pdb files--structure_predictor
: the name of the used structure predictor (the 3D structure filename is then <seq hash from seq_hash file>_<structure predictor>.pdb
)--seq_hashes
: the file containing the sequence hashes of the 3D structures--output_path
: where to output the generated sequences to--rosetta_path
: path to the rosetta systemIn the container:
tmux new-session -t NB_Server
cd ${PF}
jupyter server password
jupyter server --port 9000
Ctrl + b
then d
${CAPE}/tools/run_alphafold.sh
${PF}/tools/run_transfun.sh
C: This will take a long time. So we suggest you are also only running it for a sample. For each of the seq-hash files
${PF}/artefacts/CAPE/HIV_nef.support.seq_hash
${PF}/artefacts/CAPE-XVAE/HIV_nef.CAPE-XVAE.baseline.final.seq_hash
${PF}/artefacts/CAPE-XVAE/HIV_nef.CAPE-XVAE.reduced.final.seq_hash
${PF}/artefacts/CAPE-XVAE/HIV_nef.CAPE-XVAE.increased.final.seq_hash
${PF}/artefacts/CAPE-XVAE/HIV_nef.CAPE-XVAE.inc-nat.final.seq_hash
${PF}/artefacts/CAPE-Packer/HIV_nef.CAPE-Packer.baseline.final.seq_hash
${PF}/artefacts/CAPE-Packer/HIV_nef.CAPE-Packer.reduced.final.seq_hash
${PF}/artefacts/CAPE-Packer/HIV_nef.CAPE-Packer.increased.final.seq_hash
${PF}/artefacts/CAPE-Packer/HIV_nef.CAPE-Packer.inc-nat.final.seq_hash
run
export SEQ_HASH_FILE=<insert path to seq_hash file>
tools/run_for_seq_hashes.sh "${PF}/tools/MD/md.py --pdb ${LOCH}/structures/AF/pdb/#SEQ_HASH#_AF.pdb --output ${LOCH}/dynamics" $SEQ_HASH_FILE
http://localhost:9000/tree?
and enter password defined aboveCAPE-Eval/cape-eval.ipynb