How should we preprocess .pdbqt files before using RTMScore?

zzzzzx-1115 commented 2 years ago

We are now considering using RTMScore to rank differenct results given by AutoDock Vina (the dataset is PDBBind). Unfortunately, the output of Vina is always .pdbqt file, which cannot be directly treated as input of RTMScore.

So could you please give us advice on what we should do? We have already used Open Babel to convert .pdbqt to .sdf, but there are a lot of confusing bugs......

Thank you!

sc8668 commented 2 years ago

In this study we used Open Babel to convert .pdbqt to .sdf as well, and the molecules failing in conversion are just skipped. Another strategy is that you can just record the 3D coordinates of the docking poses, and then update the corrdinates of the input molecules (.sdf or .mol2) with the newly-generated ones. I hope these suggestions will help you!

zzzzzx-1115 commented 2 years ago

In this study we used Open Babel to convert .pdbqt to .sdf as well, and the molecules failing in conversion are just skipped. Another strategy is that you can just record the 3D coordinates of the docking poses, and then update the corrdinates of the input molecules (.sdf or .mol2) with the newly-generated ones. I hope these suggestions will help you!

That strategy sounds really cool! We have run several examples by replacing coordinates of the reference molecule with newly-generated ones, but it seems that we are still unable to pick the molecule matching the reference best, which should have got the highest score. Our setup is discribed as follows:

We first use rdkit.Chem.RemoveHs() to remove hydrogen atoms in both the generate molecule and the reference molecule (because the former has no implict hydrogen atoms discribed in its .sdf file but the latter has), and then update the corrdinates in the reference file according to those in the generated one, finally use rdkit.Chem.addHs(addCoords=True) to add hydrogen atoms back. But the results are not very satisfying. Is the last step necessary? Or is our approach in line with your suggestion?

sc8668 commented 2 years ago

To my understanding, you have successfully rescored the molecules with RTMScore, but they could not obtain the satisfactory results just using RTMScore for rescoring. It should be noticed that our method just exhibit excellent docking and screening powers rather than scoring and ranking powers. Additionally, the performance is evaluated in terms of overall statistics, and it is just common to see the bad performance of our method for some targets.

zzzzzx-1115 commented 2 years ago

Sorry I did not introduce the background clearly so probably you misunderstood our purpose...

We feed a protein-ligand pair (e.g. 1t7j_protein.pdb and 1t7j_ligand.mol2) into AutoDock Vina and get different binding poses (ligand_out_1t7j.sdf, converted from .pdbqt) for only this one pair. In this scenario RTMScore is supposed to give the highest score to the binding pose which matches the real one best among all the output ones, but we failed to do that.

I would appreciate it if you would spare some time to help us check what led to our failure. The attachment is the example mentioned above (1t7j), and the problem is that the highest ranked one (the 331st in ligand_out_1t7j.sdf) is obviously worse than the 15th one (we visualize them via PyMol btw). 1t7j.zip

sc8668 commented 2 years ago

The following are the results generated with the command "python rtmscore.py -p 1t7j_protein.pdb -l ligand_out_1t7j.sdf -m ../trained_models/rtmscore_model1.pth -o xxxqq -c 10.0 -rl 1t7j_ligand.mol2 -gen_pocket" just based on your file, and here we can successfully identify the near-native poses. Are you doing anything right?

xxxqq.csv

zzzzzx-1115 commented 2 years ago

Right, we finally find that using rdkit to read molecule files with sanitizing disabled will lead to this unexpected situation... Thanks to your patient replies, everything is ok now.

I am very grateful for your help!

sc8668 / RTMScore

How should we preprocess .pdbqt files before using RTMScore? #8