pckroon / pysmiles

A lightweight python-only library for reading and writing SMILES strings
Apache License 2.0
146 stars 21 forks source link

Molecular Formula and Molecular Weight #16

Open liquidcarbon opened 3 years ago

liquidcarbon commented 3 years ago

Hello! I found your library very helpful in parsing SMILES.

Would you be interested in adding MF and MW as additional attributes?

Something along these lines:

from collections import default_dict
from pysmiles import read_smiles

AW = {
    'C': 12.0107,
    'H': 1.00794,
    # etc.
}

class MolecularFormula:
    def __init__(self, smiles: str):
        self.smiles = smiles
        self.mf = defaultdict(lambda: 0)

        try:
            mol = read_smiles(
                smiles,
                explicit_hydrogen=False,
                reinterpret_aromatic=False,
            )
            nodes = mol.nodes()

            for i in range(mol.number_of_nodes()):
                self.mf[nodes[i]['element']] += 1
                self.mf['H'] += nodes[i]['hcount']

            self.mw = 0
            for k, v in self.mf.items():
                self.mw += AW[k] * v

            self.mw = round(self.mw, 2)
        except Exception as e:
            # log or raise
            self.mw = 0

    def __repr__(self):
        return ''.join([str(k)+str(v) for k,v in self.mf.items()])
pckroon commented 3 years ago

Happy to hear you find the library helpful :)

I'm not quite sure whether the MolecularFormula is worth adding to the library. If anything I'd make a Molecule (subclass of nx.Graph) and give that a molecular_formula attribute/property. But I'm not sure it's worth the hassle/complication. Generating the MF should be pretty straightforward anyway (collections.Counter(nx.get_node_attributes(mol, 'element').values()), and sum(nx.get_node_attributes(mol, 'hcount').values()), will get you 90% of the way).

Adding a function that calculates a molecular_weight would not be too much work, and may be valuable to numerous people. However, I'd have to find a periodic table library that's easy to install somewhere. No point in maintaining that as well...

liquidcarbon commented 3 years ago

Thanks for the tip on nx.get_node_attributes! My implementation appears to be about 25% faster than through nx. For periodic table you only need a dictionary of atomic weights (if you ignore isotopes, which I would). You get one like so:

ELEMENTS_URL = \
'https://raw.githubusercontent.com/bokeh/bokeh/branch-2.4/bokeh/sampledata/_data/elements.csv'
df = pd.read_csv(ELEMENTS_URL)
df = df[~df['atomic mass'].str.contains('\[')]  # ignore radioactive elements
AW = df.set_index('symbol')['atomic mass'].astype(float).to_dict()
pckroon commented 3 years ago

My implementation appears to be about 25% faster than through nx.

I loop over the molecule twice, once for the mass, and once for the hcount, rather than getting both at the same time.

I find pulling data from a network connection rather impolite for a library though, so I'd much rather add a dependency on a lightweight periodic table module.

liquidcarbon commented 3 years ago

Of course, I'm not suggesting to execute it every time someone imports. This is just a way to retrieve data. I hard-coded the dictionary into my module that does MW calculation.