lewisacidic / scikit-chem

A high level cheminformatics package for the Scientific Python stack, built on RDKit.
http://scikit-chem.readthedocs.io/en/latest/index.html
Other
63 stars 13 forks source link

Descriptors #56

Open lewisacidic opened 8 years ago

lewisacidic commented 8 years ago

scikit-chem currently only provides descriptors implemented in RDKit. The easy to use core API in scikit-chem should make it very easy to implement more descriptors, fast.

This could tie in with #46. I really like the idea of eventually having scikit-chem to cheminformatics as scikit-learn is to machine learning: a good place to get easy, open access to a large core of the functionality in an area, with good documentation of the underlying theory.

Specifically for descriptors, I am looking at implementing somewhere between ChemoPy and DRAGON in levels of quantity.

To enumerate the DRAGON list (create a separate issue for each type), or strike out if it is out of scope:

Each descriptor set should ideally have

We should implement the descriptors as functions for now.

These need to be plugged into the pipelining architecture somewhere along the line, so they will probably end up as (static?) methods on classes (so they can be used directly), which are called from the class' Descriptorizer._transform_mol (but lets not call it Descriptorizer...). They should dictate the index for the columns that are provided, but _transform_mol should yield results as np.arrays so they can be stacked faster.

Caching

A (the?) key issue with descriptors (especially in ChemoPy from what I can see), is that there are quantities that are required for many, diverse descriptors. These should be cached in a good implementation. For example, RDKit caches them as m._{}, such as m._gasteigerCharges.

Whilst this works, it dirties the object, takes up unnecessary memory, and makes writing code quite difficult, as you need to call everything your descriptor depends on in case the quantities haven't been cached.

I propose a method similar in function, but perhaps cleaner in execution. We could declare requirements for cached descriptors, and have the value injected into the function.

Example usage is:

from .caching import cache

>>> @cache
... def atomic_masses(mol):
...     return mol.atoms.atomic_mass

>>> @cache.inject(atomic_masses)
... def molecular_weight(mol, a_mass):
...    return sum(a_mass)

>>> m = skchem.Mol.from_smiles('CC').add_hs()
>>> molecular_weight(m)
30...

>>> m.cache['atomic_masses']
array([12, 12, 1, 1, 1, 1, 1, 1])

The cache puts the properties into m.cache dictionary on the Mol, which could be emptied once we are finished calculating descriptors.

If there are options, we could save a nested dictionary


>>> @cache
... def atomic_masses(mol, heavy=False):
...     return mol.atoms.atomic_masses + (1 if heavy else 0)

>>> atomic_masses(mol, heavy=False); atomic_masses(mol, heavy=True);
>>> mol.cache['atomic_masses']
{False: array([12, 12, 1, 1, 1, 1, 1, 1]), True: array([13, 13, 2, 2, 2, 2, 2, 2])}

>>> @cache.inject(atomic_masses)
... def molecular_mass(mol, a_mass, heavy=False):
...    return sum(a_mass)

>>> molecular_mass(mol)
30...

>>> molecular_mass(mol, heavy=True)
36...

The caching mechanism forwards the keyword args to the injected properties.