michellab / BioSimSpace

Code and resources for the EPSRC BioSimSpace project.
https://biosimspace.org
GNU General Public License v3.0
77 stars 19 forks source link

Molecular sanitization #124

Closed lohedges closed 3 years ago

lohedges commented 4 years ago

In figuring out the reason for some MCS mapping discrepancies between BioSimSpace and Cresset I have discovered that we do not sanitize molecules before performing the MCS match. This is because I followed the exact approach used by FESetup, where no sanitization is performed, e.g. the following which is taken from mutate/util.py:

mol1 = rdkit.Chem.MolFromMol2Block(mol2str_1, sanitize = False, removeHs = False)

The Cresset FEP pipeline uses a modified version of the open source LOMAP to generate its MCS mapping. This too uses RDKit for the actual grunt work, however molecules are sanitized on load. Another difference is the use of completeRingsOnly=False when calling rdFMCS (we set to True).

Should we be sanitizing the molecules? Is there are reason why FESetup chose not to do this? (Perhaps it was unreliable at the time, or we want to apply our own chemical intuition to the generated MCS mapping.)

If we do want to sanitize, then it's a trivial change. However, I'll need to update all of the MCS tests since the mappings that are generated are slightly different.

jmichel80 commented 4 years ago

I haven't tested myself the effect of sanitizing the input. The partial ring match has proven useful for the Cresset implementation but could potentially be problematic in some instances.

We should expose these keywords but keep defaults to the current behaviour to facilitate testing as the best practices haven't been conclusively worked out.

It would be interesting to check if one protocol appears consistently better in terms of e.g. size of the MCSS.

Sent from my iPhone

On 1 Oct 2019, at 17:34, Lester Hedges notifications@github.com<mailto:notifications@github.com> wrote:

In figuring out the reason for some MCS mapping discrepancies between BioSimSpace and Cresset I have discovered that we do not sanitizehttp://www.rdkit.org/docs/RDKit_Book.html#molecular-sanitization molecules before performing the MCS match. This is because I followed the exact approach used by FESetup, where no sanitization is performed, e.g. the following which is taken from mutate/util.pyhttps://github.com/CCPBioSim/fesetup/blob/master/mutate/util.py:

mol1 = rdkit.Chem.MolFromMol2Block(mol2str_1, sanitize = False, removeHs = False)

The Cresset FEP pipeline uses a modified version of the open source LOMAPhttps://github.com/MobleyLab/Lomap to generate its MCS mapping. This too uses RDKit for the actual grunt work, however molecules are sanitized on load. Another difference is the use of completeRingsOnly=False when calling rdFMCS (we set to True).

Should we be sanitizing the molecules? Is there are reason why FESetup chose not to do this? (Perhaps it was unreliable at the time, or we want to apply our own chemical intuition to the generated MCS mapping.)

If we do want to sanitize, then it's a trivial change. However, I'll need to update all of the MCS tests since the mappings that are generated are slightly different.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/michellab/BioSimSpace/issues/124?email_source=notifications&email_token=ACZN3ZEG7G2WDJTMT25ZBLTQMN32ZA5CNFSM4I4L6O7KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HO42YFA, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACZN3ZDIHG4A5LGZAQ5O5HLQMN32ZANCNFSM4I4L6O7A.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

lohedges commented 4 years ago

Thanks, @jmichel80.

We should expose these keywords but keep defaults to the current behaviour to facilitate testing as the best practices haven't been conclusively worked out.

I agree with this, although the MCS matching is so unreliable that we have multiple levels of fall back, which means that certain options aren't supported by all of the MCS engines. For example, we always use RDKit first, since it is the most robust, but fall back on Sire if the mapping doesn't contain the user pre-match. Sire doesn't have any sanitization functionality (that I'm aware of) so the keyword would have to be ignored here.

It would be interesting to check if one protocol appears consistently better in terms of e.g. size of the MCSS.

I'm not yet aware of any situation where sanitized input produces a larger MCS mapping. In some cases it is significantly smaller than the one we currently generate. For example, from discussions with @ptosco where we tried to debug the p38 ligand pair mapping issue at the bottom of this thread, we found that sanitization generates a 31 atom pair mapping, whereas non-sanitized input generates a 41 atom pair mapping. (Here the two ligands have 41 and 43 atoms.) Importantly, the smaller sanitized mapping can be merged, where the larger one that is currently generated results in a error, regardless of any ring breaking options that are specified during the merge.

I guess I was raising this question since I want to know if our mappings always make physical sense, i.e. we don't want to simply rely on size being a proxy for a good mapping, I'd rather take a smaller mapping if the larger one is nonsense. One problem with sanitization is that it looks like it could add atoms to a molecule, which would mean that the mapping might not be compatible with the original molecule that the user passed in.

lohedges commented 3 years ago

Added a cross-reference to the issue above. This shows a case where the sanitzed LOMAP input generates a larger mapping than BioSimSpace.