pengxingang / PocketXMol

Implementation of PocketXMol, the pocket-interacting molecular generative foundation model.
MIT License
13 stars 0 forks source link

PocketXMol

This is a preliminary code release for the pocket-interacting foundation model "PocketXMol".

Please note that this code is a preview version and has yet to be cleaned up and refactored. It may be hard to read but should be functional for running.

Contents

Setup

Environment

To setup the environment on a Linux server, you can use Anaconda to create a new environment pxm from the environment.yml file (for CUDA 11.7) using the following commands (takes several minutes):

conda env create -f environment.yml
conda activate pxm

If you have a different CUDA version, you may need to modify the versions of the pytorch-related packages in the environment.yml file.

Data and model weights

Processed test data and model weights for sampling

For sampling, the processed data and trained model weights are included in the file data_test.tar.gz available from the Google Drive. Download and extract it using the command:

tar -zxvf data_test.tar.gz

After extraction, there will be a directory named data which contains:

Processed data for training

For training, the demonstrative processed training data are in the file data_train_processed_reduced.tar.gz from the Google Drive. The complete processed training data are too large (>500G) so we provide a reduced subset just to demonstrate the training process. Similarly, download and extract it using the command:

tar -zxvf data_train_processed_reduced.tar.gz

Then there is a directory named data_training containing reduced training sets for demonstrative training.

Raw data and processing steps

If you want to train the model with the full training data, please follow the instructions in the process/process_steps.md file to process the raw data for complete training.

Sample for data in test sets

We provide the configuration files for sampling in the test sets of individual tasks.

NOTE:

  • The batch size for sampling is defined in the configuration files. They were verified on an 80G A100 GPU. If the batch size is too large for your GPU memory, please reduce batch_size in the configuration files or directly set the batch size in the command line (e.g., --batch_size 100).
  • Typical running time for individual test sets is around 1 ~ 6 hours on a single A100 GPU.
  • After sampling, there will be a new directory in the specified outdir containing the generated results. The new directory is named as {exp_name}_{timestamp} where exp_name is created from the names of the configuration file and timestamp is the time when the experiment starts. Within it, the SDF subdirectory contains the generated molecules, and files gen_info.csv and log.txt contain the generation information.

Small molecule docking

Sample docking poses for 428 pairs of protein and small-molecule in the PoseBusters set.

python scripts/sample_drug3d.py \
    --config_task configs/sample/test/dock_poseboff/base.yml \
    --outdir outputs_test/dock_posebusters \
    --device cuda:0

The task configuration files are in configs/sample/test/dock_poseboff. Configuration files include:

Confidence scores

The self-confidence scores are in the gen_info.csv file produced during the sampling process. To calculate other confidence scores for the generated molecular poses, use the following command:

python scripts/believe.py \
    --exp_name base_pxm \
    --result_root outputs_test/dock_posebusters \
    --config configs/sample/confidence/tuned_cfd.yml \
    --device cuda:0

The parameters:

Ranking scores

To get the ranking scores for pose selection, after obtaining the confidence scores, use the following command:

python scripts/rank_pose.py \
    --exp_name base_pxm \
    --result_root outputs_test/dock_posebusters \
    --db poseboff

to produce the ranking.csv file which contains the self_ranking and tuned_ranking columns as ranking scores.

Peptide docking

Sample docking poses for 79 pairs of protein and peptide in the peptide docking test set.

python scripts/sample_pdb.py \
    --config_task configs/sample/test/dock_pepbdb/base.yml \
    --outdir outputs_test/dock_pepbdb \
    --device cuda:0

The task configuration files are in configs/sample/test/dockpep_pepbdb. Configuration files include:

Molecular conformation generation

Sample molecular conformations for the 199 molecules in the conformation test set.

python scripts/sample_drug3d.py \
    --config_task configs/sample/test/conf_geom/base.yml \
    --outdir outputs_test/conf_geom \
    --device cuda:0

Structure-based drug design (SBDD)

Sample drug-like molecules for the 100 protein pockets in the SBDD test set.

python scripts/sample_drug3d.py \
    --config_task configs/sample/test/sbdd_csd/base.yml \
    --outdir outputs_test/sbdd_csd \
    --device cuda:0

The task configuration files are in configs/sample/test/sbdd_csd. Configuration files include:

3D molecule generation

Generate drug-like molecules with the sizes as the GEOM-Drug validation set.

python scripts/sample_drug3d.py \
    --config_task configs/sample/test/denovo_geom/base.yml \
    --outdir outputs_test/denovo_geom \
    --device cuda:0

The task configuration files are in configs/sample/test/denovo_geom. Configuration files include:

Fragment linking

Design molecules by linking fragments for the 416 pairs of proteins and fragments in the fragment linking test set.

python scripts/sample_drug3d.py \
    --config_task configs/sample/test/linking_moad/known_connect.yml \
    --outdir outputs_test/linking_moad \
    --device cuda:0

The task configuration files are in configs/sample/test/linking_moad. Configuration files include:

PROTAC design

Design PROTAC molecules by linking fragments for the 43 fragment pairs in the PROTAC-DB test set.

python scripts/sample_drug3d.py \
    --config_task configs/sample/test/linking_protacdb/fixed_fragpos.yml \
    --outdir outputs_test/linking_protacdb \
    --device cuda:0

The task configuration files are in configs/sample/test/linking_protacdb. Configuration files include (all assume known connecting atoms of fragments):

Fragment growing

Design molecules through growing fragments for the 53 pairs of fragment and protein in the fragment growing test set.

python scripts/sample_drug3d.py \
    --config_task configs/sample/test/growing_csd/base.yml \
    --outdir outputs_test/growing_csd \
    --device cuda:0

The task configuration file is configs/sample/test/growing_csd/base.yml.

De novo peptide design

Design peptides for the 35 protein pockets in the peptide design test set.

python scripts/sample_pdb.py \
    --config_task configs/sample/test/pepdesign_pepbdb/base.yml \
    --outdir outputs_test/pepdesign_pepbdb \
    --device cuda:0

The task configuration file is configs/sample/test/pepdesign_pepbdb/base.yml.

Peptide inverse folding

Design peptides for the 35 pairs of backbone structures and protein pockets in the peptide design test set.

python scripts/sample_pdb.py \
    --config_task configs/sample/test/pepinv_pepbdb/base.yml \
    --outdir outputs_test/pepinv_pepbdb \
    --device cuda:0

The task configuration file is configs/sample/test/pepinv_pepbdb/base.yml.

Sample for provided data

Here, we demonstrate some examples of sampling using the provided data in the data/examples directory.

Run the following command:

python scripts/sample_use.py \
    --config_task configs/sample/examples/dockmol.yml \
    --outdir outputs_examples \
    --device cuda:0

The configuration files are in configs/sample/examples, including:

More examples are on the way.

Configuration explanation

You can refer to these configuration files to adapt to your own data and tasks. Here are some simple explanation of the configuration.

Typically there are five main blocks: sample, data, transforms, task, and noise. The first three keys define the data and sampling parameters, while the last two define the task. In most cases, you only need to find a task template configuration file and modify the first three blocks.

Train

Make sure to download and extract the training data data_training_processed_reduced.tar.gz as described in the Data and model weights section. Then run the following command to train the model with reduced data:

python scripts/train_pl.py --config configs/train/train_pxm_reduced.yml --num_gpus 1

You can specify the number of GPUs to use by setting the num_gpus parameter. The training configuration file is defined in configs/train/train_pxm_reduced.yml. You can change the batch_size parameter in the configuration file to adjust to your GPU memory.

If you want to train the model with the full training data, please follow the instructions in the Raw data and processing steps section to process the raw data for training. Then, modify data.dataset.root and data.dataset.assembly_path in the training configuration file to point to the full training data directory and run the training command as above.