This is the official repo of Surface-VQMAE: Vector-quantized Masked Auto-encoders on Molecular Surfaces (ICML 2024), where a novel self-supervised algorithm is proposed to employ the abundant unlabeled molecular surfaces. Note: there are some improvements and differences from the initial version described in the ICML paper. 👉 (1) We provide a PyMol-based mechanism to produce protein surfaces, in additional to the fast-sampling method in dMaSIF. 👉 (2) We modify the point cloud network and delete the calculation of tangent vectors, since those tangent vectors are not equivariant.
We use Python 3.7.11 with cuda-version 12.0. Please run the following command to install necessary packages.
pip install -r requirements.txt
cd ./src/utils/chamfer_dist
python setup.py install --user
A light-weighted pretrained model weight can be found in the folder weight/light.pt
.
We use PDB-REDO
as the pretraining resource. You can run to download the entire dataset:
rsync -avuP \
--include='*_final.cif' \
--include='*/' \
--exclude='*' \
rsync://rsync.pdb-redo.eu/pdb-redo
./PDB_REDO
For downstream task evaluation, we provide an example here, namely, the epitope prediction challenge.
Protein structures in the SAbDab
dataset (Data version: 2023.09.23) can be downloaded
here.
Extract all_structures.zip
into the data
folder. The data
folder contains a snapshot of the dataset index (sabdab_summary_all.tsv
).
You may replace the index with the latest version here.
python vae.py ./configs/train/vae.yml
python train.py ./configs/train/surfformer.yml
If you are interested in our work and recognize our contributions, please cite it! Also, please do not hesitate to email me (fangwu97@stanford.edu) immediately if you have any sort of questions or fancy points that deserve further communication.
@article{fangsurface,
title={Surface-VQMAE: Vector-quantized Masked Auto-encoders on Molecular Surfaces},
author={Fang, WU and Li, Stan Z}
}