ketatam / DiffDock-PP

Implementation of DiffDock-PP: Rigid Protein-Protein Docking with Diffusion Models in PyTorch (ICLR 2023 - MLDD Workshop)
https://arxiv.org/abs/2304.03889
194 stars 38 forks source link

Script for docking based on pdb files #3

Closed PatWalters closed 1 year ago

PatWalters commented 1 year ago

Hi,

Thanks for your paper and the code. Could you provide an example showing how to dock two proteins where both are pdb files?

Thanks!

ketatam commented 1 year ago

Hi! Thanks a lot for your interest in our paper and code.

please check out the DB5Loader, which does exactly this. See last point in https://github.com/ketatam/DiffDock-PP#inference and see related issue #1

Let me know if you need further info.

PatWalters commented 1 year ago

Yes, I saw the note in the README, but the use of DBLoader wasn't clear to me. Also, in the README, you say

and name you PDB files {PDB_ID}_l_b.pdb and {PDB_ID}_l_b.pdb

Did you mean to say

and name you PDB files {PDB_ID}_r_b.pdb and {PDB_ID}_l_b.pdb

I tried importing DB5Loader but got an error

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[1], line 1
----> 1 from data_train_utils import DB5Loader

File ~/software/DiffDock-PP/src/data/data_train_utils.py:28
     24 warnings.filterwarnings("ignore",
     25     category=Bio.PDB.PDBExceptions.PDBConstructionWarning)
     26 from Bio.Data.IUPACData import protein_letters_3to1
---> 28 from utils import load_csv, printt
     29 from utils import compute_rmsd
     32 # -------- DATA LOADING -------

File ~/software/DiffDock-PP/src/data/utils.py:28
     24 warnings.filterwarnings("ignore",
     25     category=Bio.PDB.PDBExceptions.PDBConstructionWarning)
     26 from Bio.Data.IUPACData import protein_letters_3to1
---> 28 from utils import load_csv, printt
     29 from utils import compute_rmsd
     32 # -------- DATA LOADING -------

ImportError: cannot import name 'load_csv' from partially initialized module 'utils' (most likely due to a circular import) (/home/pwalters/software/DiffDock-PP/src/data/utils.py)

A simple example showing how to dock two pdb files would be helpful.

ntcockroft commented 1 year ago

I'm also interested in a simple example that would show how to dock two pdb files. After spending some time on this I am struggling to debug where I am making mistakes since there seems to a be a lot of moving parts. A simple example would be extremely helpful.

I think I have gotten the inference to run, but either I am not getting the correct output or don't understand how to interpret it. I was expecting to get some .pdb files of the docked structures out and list of scores. What I am getting out is a .pkl file and this output in the terminal:

complex_rmsd_summarized: {'mean': nan, 'median': nan, 'std': nan, 'lt1': nan, 'lt2': nan, 'lt5': nan, 'lt10': nan}
interface_rmsd_summarized: {'mean': nan, 'median': nan, 'std': nan, 'lt1': nan, 'lt2': nan, 'lt5': nan, 'lt10': nan}
20:11:44 Finished 0-th sweep over the data
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 70.18it/s]
Total time spent processing 5 times: 0.01475977897644043
time_to_load_data: 24.983391284942627
Average CRMSD < 5: nan
Average CRMSD < 2: nan
20:11:44 Dumped data!!

I'll share what I have tried so far and maybe that will help others troubleshoot. I built the conda environment as specified in the readme. I initially tried importing DB5Loader into a python session, but after going down a rabbit hole of figuring out how to pass arguments to it and run inference after loading the data I gave up on that approach.

Instead I created a directory called "my_test" in the downloaded project directory to serve as my working directory. I took the mentions of using inference.sh and DB5Loader as meaning I should modify config/db5_esm_inference.yaml and pass it to a modified src/inference.sh.

I copied config/db5_esm_inference.yaml to my_test/my_test.yaml and changed the parameters for data_file and data_path to point to my files in the my_test working directory. Example of those lines that changed:

 # file is parsed by inner-most keys only
 data:
     dataset: db5
     data_file: my_test/my_test.csv
     data_path: my_test/
     resolution: residue

I edited src_inference.sh to point to my working directory files for my test run:

NUM_FOLDS=1  # number of seeds to try, default 5
SEED=0  # initial seed
CUDA=0  # will use GPUs from CUDA to CUDA + NUM_GPU - 1
NUM_GPU=1
BATCH_SIZE=1  # split across all GPUs
NUM_SAMPLES=40

NAME="my_test"  # change to name of config file
RUN_NAME="my_test_run0"
CONFIG="my_test/my_test.yaml"

SAVE_PATH="ckpts/${RUN_NAME}"
VISUALIZATION_PATH="visualization/${RUN_NAME}"
STORAGE_PATH="my_test/${RUN_NAME}.pkl"

FILTERING_PATH="checkpoints/confidence_model_dips/fold_0/"
SCORE_PATH="checkpoints/large_model_dips/fold_0/"

echo SCORE_MODEL_PATH: $SCORE_PATH
echo CONFIDENCE_MODEL_PATH: $SCORE_PATH
echo SAVE_PATH: $SAVE_PATH

python src/main_inf.py \
    --mode "test" \
    --config_file $CONFIG \
    --run_name $RUN_NAME \
    --save_path $SAVE_PATH \
    --batch_size $BATCH_SIZE \
    --num_folds $NUM_FOLDS \
    --num_gpu $NUM_GPU \
    --gpu $CUDA --seed $SEED \
    --logger "wandb" \
    --project "DiffDock Tuning" \
    --visualize_n_val_graphs 25 \
    --visualization_path $VISUALIZATION_PATH \
    --filtering_model_path $FILTERING_PATH \
    --score_model_path $SCORE_PATH \
    --num_samples $NUM_SAMPLES \
    --prediction_storage $STORAGE_PATH \
    #--entity coarse-graining-mit \
    #--debug True # load small dataset

Prior to running src/inference.sh the "my_test" directory contains:

1A2K_l_b.pdb  1ACB_l_b.pdb  my_test.csv
1A2K_r_b.pdb  1ACB_r_b.pdb  my_test.yaml

I obtained 1A2K and 1ACB pdb files from the provided link at: https://github.com/ketatam/DiffDock-PP#db55-data

my_test.csv contains:

path,split
/needs/full/path/to/DiffDock-PP/my_test/1A2K,test
/needs/full/path/to/DiffDock-PP/my_test/1ACB,test

note: while other files have worked with relative paths, the paths to the pdb files need to be the full path, while also not containing the suffixes (eg, _r_b.pdb). Additionally, as mentioned in issue #4, you need to have 2 pairs or else the run fails.

Finally, I run sh src/inference.sh from the top level project directory and get these files output (along with the terminal output at the very beginning of this post).

 my_test_cache_v2_b.pkl  my_test_run0.pkl my_test_esm_b.pkl

I also get the a variety runtime warnings (a sample of them below), which seems likely to be leading to the nan values output

envs/diffdock_pp/lib/python3.10/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice.
DiffDock-PP/src/evaluation/compute_rmsd.py:69: RuntimeWarning: invalid value encountered in long_scalars
  'lt10': 100 * (rmsds_np < 10.0).sum() / len(rmsds_np)
ketatam commented 1 year ago

Hi @PatWalters and @ntcockroft,

sorry for the delayed reply and thanks for pointing out the issues with running inference on .pdb files. Your extensive comments were very helpful to solve the encountered issues.

I have fixed these issues (see latest commit) and as requested I have added a simple example to show how to run inference on a single (or more) pdb files. See the script src/db5_inference.sh and the config file config/single_pair_inference.yaml. This script allows you to run inference on a single pair that is now located in datasets/single_pair_dataset/structures.

Please pull the latest code version and let me know if you still encounter any issues.

ketatam commented 1 year ago

@PatWalters to answer your specific concerns:

ketatam commented 1 year ago

@ntcockroft to answer your specific concerns:

ketatam commented 1 year ago

I will close this issue for now. Feel free to reopen it if you still encounter any issue