Closed PatWalters closed 1 year ago
Hi! Thanks a lot for your interest in our paper and code.
please check out the DB5Loader
, which does exactly this. See last point in https://github.com/ketatam/DiffDock-PP#inference and see related issue #1
Let me know if you need further info.
Yes, I saw the note in the README, but the use of DBLoader wasn't clear to me. Also, in the README, you say
and name you PDB files {PDB_ID}_l_b.pdb and {PDB_ID}_l_b.pdb
Did you mean to say
and name you PDB files {PDB_ID}_r_b.pdb and {PDB_ID}_l_b.pdb
I tried importing DB5Loader but got an error
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
Cell In[1], line 1
----> 1 from data_train_utils import DB5Loader
File ~/software/DiffDock-PP/src/data/data_train_utils.py:28
24 warnings.filterwarnings("ignore",
25 category=Bio.PDB.PDBExceptions.PDBConstructionWarning)
26 from Bio.Data.IUPACData import protein_letters_3to1
---> 28 from utils import load_csv, printt
29 from utils import compute_rmsd
32 # -------- DATA LOADING -------
File ~/software/DiffDock-PP/src/data/utils.py:28
24 warnings.filterwarnings("ignore",
25 category=Bio.PDB.PDBExceptions.PDBConstructionWarning)
26 from Bio.Data.IUPACData import protein_letters_3to1
---> 28 from utils import load_csv, printt
29 from utils import compute_rmsd
32 # -------- DATA LOADING -------
ImportError: cannot import name 'load_csv' from partially initialized module 'utils' (most likely due to a circular import) (/home/pwalters/software/DiffDock-PP/src/data/utils.py)
A simple example showing how to dock two pdb files would be helpful.
I'm also interested in a simple example that would show how to dock two pdb files. After spending some time on this I am struggling to debug where I am making mistakes since there seems to a be a lot of moving parts. A simple example would be extremely helpful.
I think I have gotten the inference to run, but either I am not getting the correct output or don't understand how to interpret it. I was expecting to get some .pdb files of the docked structures out and list of scores. What I am getting out is a .pkl file and this output in the terminal:
complex_rmsd_summarized: {'mean': nan, 'median': nan, 'std': nan, 'lt1': nan, 'lt2': nan, 'lt5': nan, 'lt10': nan}
interface_rmsd_summarized: {'mean': nan, 'median': nan, 'std': nan, 'lt1': nan, 'lt2': nan, 'lt5': nan, 'lt10': nan}
20:11:44 Finished 0-th sweep over the data
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 70.18it/s]
Total time spent processing 5 times: 0.01475977897644043
time_to_load_data: 24.983391284942627
Average CRMSD < 5: nan
Average CRMSD < 2: nan
20:11:44 Dumped data!!
I'll share what I have tried so far and maybe that will help others troubleshoot. I built the conda environment as specified in the readme. I initially tried importing DB5Loader into a python session, but after going down a rabbit hole of figuring out how to pass arguments to it and run inference after loading the data I gave up on that approach.
Instead I created a directory called "my_test" in the downloaded project directory to serve as my working directory. I took the mentions of using inference.sh and DB5Loader as meaning I should modify config/db5_esm_inference.yaml and pass it to a modified src/inference.sh.
I copied config/db5_esm_inference.yaml to my_test/my_test.yaml and changed the parameters for data_file and data_path to point to my files in the my_test working directory. Example of those lines that changed:
# file is parsed by inner-most keys only
data:
dataset: db5
data_file: my_test/my_test.csv
data_path: my_test/
resolution: residue
I edited src_inference.sh to point to my working directory files for my test run:
NUM_FOLDS=1 # number of seeds to try, default 5
SEED=0 # initial seed
CUDA=0 # will use GPUs from CUDA to CUDA + NUM_GPU - 1
NUM_GPU=1
BATCH_SIZE=1 # split across all GPUs
NUM_SAMPLES=40
NAME="my_test" # change to name of config file
RUN_NAME="my_test_run0"
CONFIG="my_test/my_test.yaml"
SAVE_PATH="ckpts/${RUN_NAME}"
VISUALIZATION_PATH="visualization/${RUN_NAME}"
STORAGE_PATH="my_test/${RUN_NAME}.pkl"
FILTERING_PATH="checkpoints/confidence_model_dips/fold_0/"
SCORE_PATH="checkpoints/large_model_dips/fold_0/"
echo SCORE_MODEL_PATH: $SCORE_PATH
echo CONFIDENCE_MODEL_PATH: $SCORE_PATH
echo SAVE_PATH: $SAVE_PATH
python src/main_inf.py \
--mode "test" \
--config_file $CONFIG \
--run_name $RUN_NAME \
--save_path $SAVE_PATH \
--batch_size $BATCH_SIZE \
--num_folds $NUM_FOLDS \
--num_gpu $NUM_GPU \
--gpu $CUDA --seed $SEED \
--logger "wandb" \
--project "DiffDock Tuning" \
--visualize_n_val_graphs 25 \
--visualization_path $VISUALIZATION_PATH \
--filtering_model_path $FILTERING_PATH \
--score_model_path $SCORE_PATH \
--num_samples $NUM_SAMPLES \
--prediction_storage $STORAGE_PATH \
#--entity coarse-graining-mit \
#--debug True # load small dataset
Prior to running src/inference.sh the "my_test" directory contains:
1A2K_l_b.pdb 1ACB_l_b.pdb my_test.csv
1A2K_r_b.pdb 1ACB_r_b.pdb my_test.yaml
I obtained 1A2K and 1ACB pdb files from the provided link at: https://github.com/ketatam/DiffDock-PP#db55-data
my_test.csv contains:
path,split
/needs/full/path/to/DiffDock-PP/my_test/1A2K,test
/needs/full/path/to/DiffDock-PP/my_test/1ACB,test
note: while other files have worked with relative paths, the paths to the pdb files need to be the full path, while also not containing the suffixes (eg, _r_b.pdb). Additionally, as mentioned in issue #4, you need to have 2 pairs or else the run fails.
Finally, I run sh src/inference.sh
from the top level project directory and get these files output (along with the terminal output at the very beginning of this post).
my_test_cache_v2_b.pkl my_test_run0.pkl my_test_esm_b.pkl
I also get the a variety runtime warnings (a sample of them below), which seems likely to be leading to the nan values output
envs/diffdock_pp/lib/python3.10/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice.
DiffDock-PP/src/evaluation/compute_rmsd.py:69: RuntimeWarning: invalid value encountered in long_scalars
'lt10': 100 * (rmsds_np < 10.0).sum() / len(rmsds_np)
Hi @PatWalters and @ntcockroft,
sorry for the delayed reply and thanks for pointing out the issues with running inference on .pdb files. Your extensive comments were very helpful to solve the encountered issues.
I have fixed these issues (see latest commit) and as requested I have added a simple example to show how to run inference on a single (or more) pdb files. See the script src/db5_inference.sh
and the config file config/single_pair_inference.yaml
. This script allows you to run inference on a single pair that is now located in datasets/single_pair_dataset/structures
.
Please pull the latest code version and let me know if you still encounter any issues.
@PatWalters to answer your specific concerns:
utils
located on different levels (one in src
and one in src/data
). When you ran the command it referenced the wrong one. I have renamed the utils
file in src/data
and this issue should be fixed nowsrc/db5_inference.sh
@ntcockroft to answer your specific concerns:
src/db5_inference.sh
run_inference_without_confidence_model
to False
in the config file. This was just implemented for evaluation and debugging reasons.I will close this issue for now. Feel free to reopen it if you still encounter any issue
Hi,
Thanks for your paper and the code. Could you provide an example showing how to dock two proteins where both are pdb files?
Thanks!