openforcefield / protein-ligand-benchmark

Protein-Ligand Benchmark Dataset for Free Energy Calculations
MIT License
150 stars 15 forks source link

Clashing binding poses #24

Closed msuruzhon closed 2 years ago

msuruzhon commented 2 years ago

Hello,

I have been looking into the CDK2 test set and I ended up having some numerical issues related to high clashes between some of the ligands (1h1s, 28 and 29). I have attached an example picture for reference:

image

One of the protein hydrogen atoms is in very close proximity to one of the ligand oxygen atoms. Is this a known issue with this part of the test set? It might make sense to change the input coordinates because as it stands some of the systems are not possible to run "out of the box". Any suggestions will be appreciated.

Many thanks.

ppxasjsm commented 2 years ago

@jchodera, @dfhahn any ideas what might be going on here? Has this input not been used for a whole benchmark run?

davidlmobley commented 2 years ago

I agree that looks funky. This may be what was actually run, though; in general there’s a lot of inherited stuff here that is only gradually getting filtered out. For example, in some of the earlier benchmarking work from others that this built on, there were missing loops and residues in some of the protein structures, varied handling of water across targets, etc. etc. In other words — all kinds of issues. HOWEVER, free energy calculations often gave reasonable results anyway. We (as a community) are beginning to get some of those problems removed, curated out, etc., but there’s likely still a not more to be done.

davidlmobley commented 2 years ago

I'll leave it to @dfhahn and @ldamore to comment specifically.

dfhahn commented 2 years ago

Hi @msuruzhon, @ppxasjsm thanks for raising this issue. It is indeed not an ideal starting pose and originates presumably from aligning the core of the ligands to the crystal structure ligands. This was the structure used for the previous benchmark runs, the reason for it being in here as well. An energy minimization, at least in the Gromacs/pmx workflow resolved the clash. As @davidlmobley pointed out, there are still many inherited issues in this set which need to be removed.

ppxasjsm commented 2 years ago

I am struggling to see how this is a benchmark dataset then if we can't use the inputs as benchmarks. I can understand that there may be some inherited issues, but steric clashes that don't easily resolve in a minimization doesn't really seem like a sensible dataset to push in the first place. Why not used the minimized/equilibrated structures that work with Gromacs? Is there anyone from OpenFF working on this at the moment? Does OpenFF not run automated bechmarks at the moment?

davidlmobley commented 2 years ago

I believe we normally start with an energy minimization.

ppxasjsm commented 2 years ago

I agree, but what if the minimisation doesn’t resolve the clashes?

davidlmobley commented 2 years ago

That would be a problem, but all of these ARE successfully used for our binding free energy benchmarking. I'm also surprised by the clash, and I suppose it could be resolved by depositing the minimized structure instead, but is that what we want? I'm not sure.

ppxasjsm commented 2 years ago

I see! Is all the information needed to reproduce your benchmarks successfully in the repo? What would be the approach to propose alternative input used, that worked in a different set of benchmarks. Would it be helpful to have the input and successfully run protocols and final outputs (not trajectories) available in this case?

dfhahn commented 2 years ago

All steric clashes which are present were easily resolved by energy minimization. But I agree it would be more sensible to provide the minimized structures. Although that could lead to less aligned ligand sets and could (presumably only slightly) break compatibility with previously run benchmarks.

What would be the approach to propose alternative input used, that worked in a different set of benchmarks.

I guess we want to only have one input, not alternative ones as this might be confusing. I would suggest to create PRs with better structures which will go into next releases. Then you can point to the release used when reporting results.

Would it be helpful to have the input and successfully run protocols and final outputs (not trajectories) available in this case?

What do you mean with having successfully run protocols? Just name it or link to a repo with the protocol? Having the output could be an option, but does it add value? Will people use it for something?

bcossins commented 2 years ago

Hi, Just to follow on from Miro's post. We have found that some clashes for CDK2 were not resolvable and we tried minimising with a few different protocols including GMX-2021 on a cpu. The image shows a clash that goes through a lysine side-chain. We are using standard and well tested setting for minimisation.

There were various other clashes for a few other systems that caused us to want to adjust the inputs to our minimisations. This seems like it would make reproducibility and good comparisons more difficult. Removing these clashes will make these files more useable as at the moment some who encounter the same problems as us would have to make up their own alternative inputs.

image (12)

davidlmobley commented 2 years ago

Propagating this to #binding-benchmarks on OpenFF Slack; I'm thinking maybe we should also have minimized and/or cleaned up structures here and deprecate those with clashes... The important thing is to document, I think.

dotsdl commented 2 years ago

Closing this out may be dependent on resolving #20 first, but we can still find a solution to the clashing problem in the meantime before writing out new sets of structure files in a PR.

IAlibay commented 2 years ago

Note: To be updated with more information

Affected systems

Using a distance cutoff of 0.5 A, the following systems have at least once clash for at least one of its ligands:

MCL1

BACE_hunt

PBKFB3