michellab / BioSimSpace

Code and resources for the EPSRC BioSimSpace project.
https://biosimspace.org
GNU General Public License v3.0
77 stars 19 forks source link

Poor MCS mapping behaviour #30

Closed lohedges closed 5 years ago

lohedges commented 6 years ago

From discussions with @ppxasjsm it is apparent that BioSimSpace is failing to obtain good mappings and alignments for ligands with markedly different binding poses. An example set of Cathespin ligands can be found here, along with a description of the issue.

The problem arises when you don't have a crystal structure for a ligand, then dock to achieve a binding pose. If the binding pose is markedly different to the reference ligand (the one which you want to map and align to) then the mapping and alignment is bad, even if the ligands are structurally very similar.

As an example, consider the Cat_4 and Cat_5 ligands shown below. From the 2D representation of the molecules it is clear that there is a large overlap in the molecular structure, which should be picked up by the maximum common substructure (MCS) mapping.

cathespin2d

There is a known binding pose for Cat_5, whereas the pose for Cat_4 has been obtained via docking. The VMD representations below show the 3D structure for the two poses, which are clearly very different.

cats_montage

We want to map atoms from the docked Cat_4 pose to those in Cat_5, then align Cat_4 to Cat_5 based on this mapping. Despite the similarity in the structure, BioSImSpace struggles to find a good mapping, matching only 11 atoms if run for the default timeout, and 31 if this is extended to 5 minutes. (Cat_4 has 97 atoms, Cat_5 has 95.) Aligning the molecules based on this mapping gives a poor result. (Since the alignment only performs translations and rotations of the entire ligand, this is perhaps not too surprising given the large difference in binding pose.)

An alternative solution is the use of a mapping scheme that allows the flexible transformation of a target molecule onto a reference molecule. This can be achieved, for example, by using the FKCOMBU command-line utility. (This works by performing a pair-wise MCS mapping but allowing deformations of the target structure.) Using this tool with the default options gives an excellent alignment in around 3 seconds! A comparison of the alignments obtained by BioSimSpace and FKCOMBU is show below.

alignment_montage

Do we want to support flexible transformations such as this? If so, it might be possible to wrap FKCOMBU as part of BioSimSpace.Align. (It's unlicensed as far as I can tell. You just need to fill in a simple form to get a download link.) One wrinkle is that FKCOMBU does the mapping and alignment in one go, whereas they can be decoupled in BioSimSpace. (You can get the mapping by itself, then align based on this mapping using an alignment protocol, which could allow for flexibility.) However, it should be easy enough to infer the mapping based upon the alignment (by coordinate matching) if we do want to decouple them.

Perhaps all of this is beyond the scope of BioSImSpace and we would assume that the user has already prepared their ligand as best as they can before attempting a mapping, e.g. by using FKCOMBU themselves. That said, I'm interested as to why we are struggling to get a good MCS mapping in the first place. I'll try to investigate further and post an update.

lohedges commented 6 years ago

For reference, a paper on the FKCOMBU method can be found here.

lohedges commented 6 years ago

We can also use the PKCOMBU program just to get the MCS mapping (rather than using FKCOMBU to do the mapping and flexible alignment). This gives a good mapping near instantaneously (37 matched atom pairs). I'll look into the logic to try to figure out why their MCS matching is working so much better than ours.

Here's a reference for the KCOMBU method.

lohedges commented 6 years ago

As another test I took the well aligned Cat_4 structure generated by FKCOMBU, then used BioSimSpace to get an atom mapping between this and Cat_5. This should test whether the poor mapping is a result of the difference in the ligand geometry, or the actual matching algorithm (our MCS vs the build-up MCS of KCOMBU).

Running the BioSimSpace atom matcher for the default timeout I get exactly the same result as before (using the original docked Cat_4 structure). This implies that the difference is because down to the matching algorithm, rather than the binding pose (as should be expected, since the connectivity is the same in both cases).

jmichel80 commented 6 years ago

Hi,

I’m on leave now so will just give a brief response for now. Our bar is to be at least as good as FESetup. So if FESetup gives acceptable mappings but BioSimSpace doesn’t we have to do better. Whether by improving the BSS mapping code or by using a third party tool is not important, provided this is transparent to the user, and we can easily support and distribute the needed code. Ideally our own code so we have more control on the implementation details, but if it is too time consuming to fix our MCSS we should move on with better third party tools. If we do a topological MCSS followed by some operation on coordinates we should still be able to find large MCSS for congeneric series.

Sent from my iPhone

On 4 Oct 2018, at 13:10, Lester Hedges notifications@github.com<mailto:notifications@github.com> wrote:

We can also use the PKCOMBUhttp://strcomp.protein.osaka-u.ac.jp/kcombu/doc/README_pkcombu.html program just to get the MCS mapping (rather than using FKCOMBU to do the mapping and flexible alignment). This gives a good mapping near instantaneously (37 matched atom pairs). I'll look into the logic to try to figure out why their MCS matching is working so much better than ours.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/michellab/BioSimSpace/issues/30#issuecomment-426994673, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ALLd5H82rqLoYR7gPXmD4Q4Tw6_w4xc_ks5uhfqsgaJpZM4XH3Yr.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

lohedges commented 6 years ago

Okay, I think I've figured this out. The problem is that both molecules contain a large number of hydrogen atoms, which are now included as part of the MCS search. e.g. what we discovered when mapping methane and methanol. This leads to a massive increase in the search space, so the MCS routine becomes incredibly inefficient and struggles to find a good match.

If I revert to the original MCS scheme, where we can match light against light but not light against heavy, then I get a good mapping between the docked Cat_4 and the Cat_5 binding pose (46 atom pairs matched). Below is the new BioSimSpace alignment based on this mapping. (Obviously not as good as the flexible one, but far better than before.)

tmp

Perhaps we want to adjust our atom matching scheme to try both approaches, then take the best mapping of the two, i.e. the one with most matched atom pairs. This would allow for the possibility of matching light and heavy atoms (as in methane/methanol) but avoids the issues that we are seeing here. (This could still be problematic if we know that a hydrogen definitely maps to a heavy atom, but perhaps we could add this as a prematch.)

I'll also check the MCS code to make sure there isn't a bug in the logic for the heavy/light matching modifications that I introduced in Edinburgh (which were based on Julien's tests).

lohedges commented 6 years ago

I've added a temporary fix with commit #3c55bb0. This performs two MCS matches: one includes light atoms but doesn't allow matches between heavy and light atoms; the other allows light atoms and matches between heavy and light. The mapping that is returned is the one that produces the largest number of atom pair matches.

Ideally, these two different mappings should be generated concurrently, e.g. using multiprocessing or concurrent.futures within Python. However, the Sire objects needed for the MCS match (as well as the function itself) are not picklable so the two function calls are performed serially at present. When I get the chance I'll update findmcs.cpp so that the parallelism can be handled with TBB.

I'll leave this issue open for now as it's a useful reference and we'll no doubt come across other problems as we explore alignments for a wider range of molecules.

jmichel80 commented 6 years ago

might be useful for @JenkeScheen to take a look at http://strcomp.protein.osaka-u.ac.jp/kcombu/doc/README_fkcombu.html

lohedges commented 5 years ago

Closing since the robustness of the MCS mapping has been vastly improved by using a RDKit prematch. We'll use other threads to discuess specific issues/fixes for other ligand sets as they come up.