Script for docking based on pdb files

PatWalters commented 1 year ago

Hi,

Thanks for your paper and the code. Could you provide an example showing how to dock two proteins where both are pdb files?

Thanks!

ketatam commented 1 year ago

Hi! Thanks a lot for your interest in our paper and code.

please check out the DB5Loader, which does exactly this. See last point in https://github.com/ketatam/DiffDock-PP#inference and see related issue #1

Let me know if you need further info.

PatWalters commented 1 year ago

Yes, I saw the note in the README, but the use of DBLoader wasn't clear to me. Also, in the README, you say

and name you PDB files {PDB_ID}_l_b.pdb and {PDB_ID}_l_b.pdb

Did you mean to say

and name you PDB files {PDB_ID}_r_b.pdb and {PDB_ID}_l_b.pdb

I tried importing DB5Loader but got an error

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[1], line 1
----> 1 from data_train_utils import DB5Loader

File ~/software/DiffDock-PP/src/data/data_train_utils.py:28
     24 warnings.filterwarnings("ignore",
     25     category=Bio.PDB.PDBExceptions.PDBConstructionWarning)
     26 from Bio.Data.IUPACData import protein_letters_3to1
---> 28 from utils import load_csv, printt
     29 from utils import compute_rmsd
     32 # -------- DATA LOADING -------

File ~/software/DiffDock-PP/src/data/utils.py:28
     24 warnings.filterwarnings("ignore",
     25     category=Bio.PDB.PDBExceptions.PDBConstructionWarning)
     26 from Bio.Data.IUPACData import protein_letters_3to1
---> 28 from utils import load_csv, printt
     29 from utils import compute_rmsd
     32 # -------- DATA LOADING -------

ImportError: cannot import name 'load_csv' from partially initialized module 'utils' (most likely due to a circular import) (/home/pwalters/software/DiffDock-PP/src/data/utils.py)

A simple example showing how to dock two pdb files would be helpful.

ntcockroft commented 1 year ago

I'm also interested in a simple example that would show how to dock two pdb files. After spending some time on this I am struggling to debug where I am making mistakes since there seems to a be a lot of moving parts. A simple example would be extremely helpful.

I think I have gotten the inference to run, but either I am not getting the correct output or don't understand how to interpret it. I was expecting to get some .pdb files of the docked structures out and list of scores. What I am getting out is a .pkl file and this output in the terminal:

complex_rmsd_summarized: {'mean': nan, 'median': nan, 'std': nan, 'lt1': nan, 'lt2': nan, 'lt5': nan, 'lt10': nan}
interface_rmsd_summarized: {'mean': nan, 'median': nan, 'std': nan, 'lt1': nan, 'lt2': nan, 'lt5': nan, 'lt10': nan}
20:11:44 Finished 0-th sweep over the data
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 70.18it/s]
Total time spent processing 5 times: 0.01475977897644043
time_to_load_data: 24.983391284942627
Average CRMSD < 5: nan
Average CRMSD < 2: nan
20:11:44 Dumped data!!

I'll share what I have tried so far and maybe that will help others troubleshoot. I built the conda environment as specified in the readme. I initially tried importing DB5Loader into a python session, but after going down a rabbit hole of figuring out how to pass arguments to it and run inference after loading the data I gave up on that approach.

Instead I created a directory called "my_test" in the downloaded project directory to serve as my working directory. I took the mentions of using inference.sh and DB5Loader as meaning I should modify config/db5_esm_inference.yaml and pass it to a modified src/inference.sh.

I copied config/db5_esm_inference.yaml to my_test/my_test.yaml and changed the parameters for data_file and data_path to point to my files in the my_test working directory. Example of those lines that changed:

 # file is parsed by inner-most keys only
 data:
     dataset: db5
     data_file: my_test/my_test.csv
     data_path: my_test/
     resolution: residue

I edited src_inference.sh to point to my working directory files for my test run:

NUM_FOLDS=1  # number of seeds to try, default 5
SEED=0  # initial seed
CUDA=0  # will use GPUs from CUDA to CUDA + NUM_GPU - 1
NUM_GPU=1
BATCH_SIZE=1  # split across all GPUs
NUM_SAMPLES=40

NAME="my_test"  # change to name of config file
RUN_NAME="my_test_run0"
CONFIG="my_test/my_test.yaml"

SAVE_PATH="ckpts/${RUN_NAME}"
VISUALIZATION_PATH="visualization/${RUN_NAME}"
STORAGE_PATH="my_test/${RUN_NAME}.pkl"

FILTERING_PATH="checkpoints/confidence_model_dips/fold_0/"
SCORE_PATH="checkpoints/large_model_dips/fold_0/"

echo SCORE_MODEL_PATH: $SCORE_PATH
echo CONFIDENCE_MODEL_PATH: $SCORE_PATH
echo SAVE_PATH: $SAVE_PATH

python src/main_inf.py \
    --mode "test" \
    --config_file $CONFIG \
    --run_name $RUN_NAME \
    --save_path $SAVE_PATH \
    --batch_size $BATCH_SIZE \
    --num_folds $NUM_FOLDS \
    --num_gpu $NUM_GPU \
    --gpu $CUDA --seed $SEED \
    --logger "wandb" \
    --project "DiffDock Tuning" \
    --visualize_n_val_graphs 25 \
    --visualization_path $VISUALIZATION_PATH \
    --filtering_model_path $FILTERING_PATH \
    --score_model_path $SCORE_PATH \
    --num_samples $NUM_SAMPLES \
    --prediction_storage $STORAGE_PATH \
    #--entity coarse-graining-mit \
    #--debug True # load small dataset

Prior to running src/inference.sh the "my_test" directory contains:

1A2K_l_b.pdb  1ACB_l_b.pdb  my_test.csv
1A2K_r_b.pdb  1ACB_r_b.pdb  my_test.yaml

I obtained 1A2K and 1ACB pdb files from the provided link at: https://github.com/ketatam/DiffDock-PP#db55-data

my_test.csv contains:

path,split
/needs/full/path/to/DiffDock-PP/my_test/1A2K,test
/needs/full/path/to/DiffDock-PP/my_test/1ACB,test

note: while other files have worked with relative paths, the paths to the pdb files need to be the full path, while also not containing the suffixes (eg, _r_b.pdb). Additionally, as mentioned in issue #4, you need to have 2 pairs or else the run fails.

Finally, I run sh src/inference.sh from the top level project directory and get these files output (along with the terminal output at the very beginning of this post).

 my_test_cache_v2_b.pkl  my_test_run0.pkl my_test_esm_b.pkl

I also get the a variety runtime warnings (a sample of them below), which seems likely to be leading to the nan values output

envs/diffdock_pp/lib/python3.10/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice.
DiffDock-PP/src/evaluation/compute_rmsd.py:69: RuntimeWarning: invalid value encountered in long_scalars
  'lt10': 100 * (rmsds_np < 10.0).sum() / len(rmsds_np)

ketatam commented 1 year ago

Hi @PatWalters and @ntcockroft,

sorry for the delayed reply and thanks for pointing out the issues with running inference on .pdb files. Your extensive comments were very helpful to solve the encountered issues.

I have fixed these issues (see latest commit) and as requested I have added a simple example to show how to run inference on a single (or more) pdb files. See the script src/db5_inference.sh and the config file config/single_pair_inference.yaml. This script allows you to run inference on a single pair that is now located in datasets/single_pair_dataset/structures.

Please pull the latest code version and let me know if you still encounter any issues.

ketatam commented 1 year ago

@PatWalters to answer your specific concerns:

yes, you are right, the files should be {PDB_ID}_l_b.pdb and {PDB_ID}_r_b.pdb. I have updated the README.md accordingly.
the import error you encountered is because of two files named utils located on different levels (one in src and one in src/data). When you ran the command it referenced the wrong one. I have renamed the utils file in src/data and this issue should be fixed now
I have added a simple example showing how to dock two pdb files in src/db5_inference.sh

ketatam commented 1 year ago

@ntcockroft to answer your specific concerns:

as mentioned above, I have added a simple example showing how to dock two pdb files in src/db5_inference.sh
The reason behind the nan values is that you were running the inference without the confidence model. This behaviour can be disabled by setting the flag run_inference_without_confidence_model to False in the config file. This was just implemented for evaluation and debugging reasons.
If you would like to get the .pdb files of the docked structures, you should enable the visualization mechanism as described in the README.md. It will give you the structure of the complex at every step of the reverse diffusion process. Note that the visualization is not enabled by default because it adds an additional overhead in terms of compute time and memory, and is not always wanted. (let me know if you encounter any issues)
the logic of the evaluation is implemented here and can be updated depending on your desired evaluation

ketatam commented 1 year ago

I will close this issue for now. Feel free to reopen it if you still encounter any issue

ketatam / DiffDock-PP

Script for docking based on pdb files #3