lewisacidic / scikit-chem

A high level cheminformatics package for the Scientific Python stack, built on RDKit.
http://scikit-chem.readthedocs.io/en/latest/index.html
Other
62 stars 13 forks source link

Native Standardiser. #55

Open lewisacidic opened 8 years ago

lewisacidic commented 8 years ago

A native standardis(z)er would be a great addition to the library, as currently the only way to standardise molecules is using the ChemAxon Standardizer wrapper.

The implementation should provide a similar API to the current Standardizer, namely by inheriting from TransformFilter. It should be configurable in code, like the rest, which should also make it configurable with YAML and JSON.

>>> std = skchem.standardizers.Standardizer(remove_fragments=True, disconnect_metals=True, neutralize=True)
>>> m = skchem.Mol.from_smiles('CC.CCC', name='ethane_n_propane')
>>> std.transform(m).to_smiles()
'CCC'

>>> std.transform([m, skchem.Mol.from_smiles('CCO[Na]', 'sodium_ethoxide')
batch
ethane_n_propane   CCC
sodium_ethoxide     CCO 
Name: structure, dtype: object

Standardisation may be thought of as a series of elemental operations applied to molecules. These could be implemented as mini transformers, and the Standardizer could just be a Pipeline (this would probably require work on the Pipeline class!)

from skchem.standardizers import FragmentRemover, MetalDisconnector ...
from skchem.pipeline import Pipeline

std = Pipeline([
               FragmentRemover(remove_smallest=True),  
               MetalDisconnector(keep_grignards=True)...
])

The issue with this is: 1) This makes for a painfully verbose API 2) There is a predetermined 'most sensible'/'correct' order to perform the transforms (for example, its probably better to remove fragments before tautomerizing as tautomerizing the other fragments is wasted effort).

Perhaps it would be best to have a Standardiser object (that could possibly inherit from Pipeline) that in turn creates the smaller objects, and keeps sensible defaults.

class Standardizer(Pipeline):
    def __init__(self, remove_fragments=True, disconnect_metals=True...):
        # add in the 'sensible order'
        if remove_fragments:
             self.objects.append(FragmentRemover())
        etc.

This makes it harder to have fine grain control over these smaller objects though (maybe we want to 'keep_grignards' or something), so perhaps we could pass the actual transformer if we wanted control over this:

class Standardizer(Pipeline):
    def __init__(self, remove_fragments=True, disconnect_metals=True...):
        if remove_fragments:
             if not isinstance(remove_fragments, Transformer):
                 remove_fragments = FragmentRemover()
             self.objects.append(remove_fragments)
        etc.

This would (with luck) serialise to JSON and YAML for free, be easily configurable in a manner consistent with the rest of the library.

lewisacidic commented 8 years ago

A list of ChemAxon Standardizer 'Actions' can be found here:

https://docs.chemaxon.com/display/docs/Standardizer+Actions.

A list of the features is below. As the features are developed, we can tick or cross off (if they are unnecessary, impractical or impossible). I bolded the most desirable features in my eyes.

lewisacidic commented 8 years ago

Projects that provide similar functionality are @mcs07 's MolVS (in fact MolVS is close to implementing much of the functionality - Matt, would you mind if we used any of the code?). Others are listed in MolVS README.

lewisacidic commented 8 years ago

There is also @flatkinson 's https://github.com/flatkinson/standardiser, which I am told is being actively used in the eTox project.

Both these projects look good and battle tested. Perhaps we should write a wrapper rather than reimplement the functionality for now?

mwojcikowski commented 8 years ago

I think most of them are trivial in RDKit. There are although few, like "Tautomerize" which are way beyond easy (althouth I think Paolo Tosco might have done something in that direction judging from last UGM presentation).

Shouldn't SanitizeMol = Mesomerise? I think so.

mwojcikowski commented 8 years ago

If you want I'm happy to help with this one. I'm assembling a list of RDKit functions (or short implementaton coment)

[Still updated]

lewisacidic commented 8 years ago

Hi @mwojcikowski thanks a lot for this! @michaellampe is currently looking into this - his branch is here - I'm unfortunately too busy with my PhD to really have much input at the moment, so perhaps you both could discuss/work on it?

lewisacidic commented 8 years ago

I also had a chat with @mcs07 at the recent Cambridge Cheminformatics Network Meeting, he is hoping to continue to work on MolVS when he gets some free time (he is also super busy with PhD!). Some extra features that he mentioned he is interested in that it doesn't look like ChemAxon does is ring opening/closing (e.g. linear vs cyclic glucose).

He also suggested to look at @russodanielp's fork of MolVS that is showing some recent work, specifically around pipelining.

russodanielp commented 8 years ago

Hi @mwojcikowski and @richlewis42. I started working on the pipeline and had it work for my purposes. Still need to clean up a bit of the code.

I also am involved in a few research PhD projects but would be happy to contribute to this project of add to MolVS in my free time.

mcs07 commented 8 years ago

It also looks like the Avalon Struct Checker may soon be properly integrated into RDKit: https://github.com/rdkit/rdkit/pull/1054 Might be useful for many standardization tasks.