t7morgen / misato-dataset

GNU Lesser General Public License v2.1
172 stars 17 forks source link

Some questions on Misato and Misato-binding #17

Open EDAPINENUT opened 1 week ago

EDAPINENUT commented 1 week ago

Thanks for you opening our source code and the fantastic works.

I have a few questions I’d like to ask.

Firstly, the protein-ligand complexes in MISATO likely come primarily from the PDBbind database. However, in the Ligand Binding Dataset (LBA), the experimentally measured binding affinities are fewer than those provided in the misato-affinity/data/affinity_data.csv. How were these additional binding affinities obtained? Were they calculated or model-predicted?

Secondly, the molecular-protein complexes in MISATO-MD are represented in PDB format, which omits bonding information for small molecules. How can these be restored as valid SDF files? I found corresponding molecules in the QM dataset, but the bonding was incorrect, and I could not restore them to SDF format. Could an official script be provided to convert the MISATO-MD files into separate PDB and SDF files? Additionally, I noticed that some hydrogen atoms were missing in the protein’s PDB structure, leading to unsaturated bonding, which prevents further calculations. Is there a way to refine these structures for saturated bonding to make subsequent quantum calculations feasible?

t7morgen commented 1 week ago

Thank you very much for your questions. 1) The binding affinities come from pdbBind version 2021. Maybe LBA uses a different version? These were all experimental binding affinities, no calculations. 2) For the bonding information you could download the corresponding topology files (parameter_restart_files_MD.tar.gz) and extract this information using parmed. Additionally bonds should be displayed in the CONNECT records at the end of PDB files. 2.2) A script for the conversion of h5 to PDB can be found here: https://github.com/t7morgen/misato-dataset/blob/master/src/data/processing/h5_to_traj.py Concerning missing hydrogen atoms on the protein side, could you please give the corresponding PDB-id and residue number? The protonation was performed using AMBER tleap program.