Descriptors - Githubissues

scikit-chem currently only provides descriptors implemented in RDKit. The easy to use core API in scikit-chem should make it very easy to implement more descriptors, fast.

This could tie in with #46. I really like the idea of eventually having scikit-chem to cheminformatics as scikit-learn is to machine learning: a good place to get easy, open access to a large core of the functionality in an area, with good documentation of the underlying theory.

Specifically for descriptors, I am looking at implementing somewhere between ChemoPy and DRAGON in levels of quantity.

To enumerate the DRAGON list (create a separate issue for each type), or strike out if it is out of scope:

[ ] Constitutional
[ ] Ring descriptors
[ ] Topological indices
[ ] Walk and path counts
[ ] Connectivity indices
[ ] Information indices
[ ] 2D matrix based
[ ] 2D autocorrelations
[ ] Burden eigenvalues
[ ] P-VSA like (MOE)
[ ] ETA indices
[ ] Edge adjacency indices
[ ] Geometrical descriptors
[ ] 3D matrix-based descriptors
[ ] 3D autocorrelations
[ ] RDF descriptors
[ ] MoRSE descriptros
[ ] WHIM descriptors
[ ] Molecular profiles
[ ] Functional groups count
[ ] Atom-centered fragments (I think this is substructures...)
[ ] Atom-type E-state indices
[ ] CATS 2D
[ ] 2D atom pairs
[ ] Charge descriptors
[ ] Molecular properties
[ ] Drug-like indices
[ ] CATS 3D
Implementation details

Each descriptor set should ideally have

Mathematical definitions
Example usage in docstring (with values from the source, so the testing framework picks this up).
References to original works, preferably including DOIs.
Implement as a function

We should implement the descriptors as functions for now.

These need to be plugged into the pipelining architecture somewhere along the line, so they will probably end up as (static?) methods on classes (so they can be used directly), which are called from the class' Descriptorizer._transform_mol (but lets not call it Descriptorizer...). They should dictate the index for the columns that are provided, but _transform_mol should yield results as np.arrays so they can be stacked faster.

Caching

A (the?) key issue with descriptors (especially in ChemoPy from what I can see), is that there are quantities that are required for many, diverse descriptors. These should be cached in a good implementation. For example, RDKit caches them as m._{}, such as m._gasteigerCharges.

Whilst this works, it dirties the object, takes up unnecessary memory, and makes writing code quite difficult, as you need to call everything your descriptor depends on in case the quantities haven't been cached.

I propose a method similar in function, but perhaps cleaner in execution. We could declare requirements for cached descriptors, and have the value injected into the function.

Example usage is:

from .caching import cache

>>> @cache
... def atomic_masses(mol):
...     return mol.atoms.atomic_mass

>>> @cache.inject(atomic_masses)
... def molecular_weight(mol, a_mass):
...    return sum(a_mass)

>>> m = skchem.Mol.from_smiles('CC').add_hs()
>>> molecular_weight(m)
30...

>>> m.cache['atomic_masses']
array([12, 12, 1, 1, 1, 1, 1, 1])

The cache puts the properties into m.cache dictionary on the Mol, which could be emptied once we are finished calculating descriptors.

If there are options, we could save a nested dictionary


>>> @cache
... def atomic_masses(mol, heavy=False):
...     return mol.atoms.atomic_masses + (1 if heavy else 0)

>>> atomic_masses(mol, heavy=False); atomic_masses(mol, heavy=True);
>>> mol.cache['atomic_masses']
{False: array([12, 12, 1, 1, 1, 1, 1, 1]), True: array([13, 13, 2, 2, 2, 2, 2, 2])}

>>> @cache.inject(atomic_masses)
... def molecular_mass(mol, a_mass, heavy=False):
...    return sum(a_mass)

>>> molecular_mass(mol)
30...

>>> molecular_mass(mol, heavy=True)
36...

The caching mechanism forwards the keyword args to the injected properties.

lewisacidic / scikit-chem

Descriptors #56

Implementation details

Implement as a function

Caching