rdkit / rdkit

The official sources for the RDKit library
BSD 3-Clause "New" or "Revised" License
2.53k stars 854 forks source link

MolFromSmiles and MolFromMolBlock produce a different molece #6815

Open drkeoni opened 9 months ago

drkeoni commented 9 months ago

Describe the bug When I create a molecule from SMILES and then AddHs, the 2D molecule behaves as expected. When that same molecule is created from a Mol block, AddHs places all hydrogens at the origin. I would like both molecules to be in the same starting state.

This is similar in spirit to #6349 but the specific test case is fairly different.

To Reproduce

import rdkit.Chem as Chem
# c1ccccc1 works as well to show this bug...
m_from_smiles = Chem.MolFromSmiles("C1C=CC=CC=1")
mol_block = Chem.MolToMolBlock(m_from_smiles)
m_from_mol_block = Chem.MolFromMolBlock(mol_block)
print(Chem.MolToMolBlock(Chem.AddHs(m_from_smiles)))
print(Chem.MolToMolBlock(Chem.AddHs(m_from_mol_block)))

In the first print out you'll see something like


     RDKit          2D

 12 12  0  0  0  0  0  0  0  0999 V2000
    1.5000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.7500   -1.2990    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.7500   -1.2990    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.5000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.7500    1.2990    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.7500    1.2990    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.0000    0.0000    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    1.5000   -2.5981    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
   -1.5000   -2.5981    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
   -3.0000    0.0000    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
   -1.5000    2.5981    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    1.5000    2.5981    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  2  0
  2  3  1  0
  3  4  2  0
  4  5  1  0
  5  6  2  0
  6  1  1  0
  1  7  1  0
  2  8  1  0
  3  9  1  0
  4 10  1  0
  5 11  1  0
  6 12  1  0
M  END

In the second print out you'll see


     RDKit          2D

 12 12  0  0  0  0  0  0  0  0999 V2000
    1.5000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.7500   -1.2990    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.7500   -1.2990    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.5000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.7500    1.2990    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.7500    1.2990    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  2  0
  2  3  1  0
  3  4  2  0
  4  5  1  0
  5  6  2  0
  6  1  1  0
  1  7  1  0
  2  8  1  0
  3  9  1  0
  4 10  1  0
  5 11  1  0
  6 12  1  0
M  END

Expected behavior My issue isn't with AddHs; it's with the starting state of two molecules. I'd like to discover what operations or flags to set to put both of these molecules into the same state. Or if this is deemed a bug have it fixed in new code :)

Configuration (please complete the following information):

bp-kelley commented 9 months ago

When you start with a smiles string, there are no coordinates so when you add Hs it has to generate coordinates (the RDKit does this by default when generating a molblock)

If the molecule already has coordinates, which you get from making a molblock, addHs doesn't add coordinates by default and rdkit thinks the structure already has coords.

So in one case you have

smiles -> mol block(gen coords) -> addHs -> molblock (thinks coords are already generated and doesn't fiddle with the hydrogens)
smiles -> addHs -> molbock (gen coords, even for hs)

This is why the hydrogens don't have coords. But this is a long winded way of saying you need addCoords=True in AddHs if you already have coordinates.

print(Chem.MolToMolBlock(Chem.AddHs(m_from_mol_block, addCoords=True)))
bp-kelley commented 9 months ago

I tend to think that AddHs should add coordinates if the molecule already has them as this follow the path of least surprise, but this is a change to current behavior.

drkeoni commented 9 months ago

I understand the explanation, thanks for helping me see this!

I'd still like advice on how to have the result of "creation from SMILES" and the result of "creation from Mol block" provide the same starting point. It sounds to me like I might need to add coordinates to the SMILES-produced compound before calling AddHs? My goal here isn't to change how AddHs is called (that was an example); my goal is to have both methods of creation start from the same starting state for molecular operations (except of course the exact coordinates would be different).

(in my example I'm demonstrating that round-tripping is not what one might think; MolFromMolBlock(MolToMolBlock(m)) does not provide the same result as m)

greglandrum commented 8 months ago

I understand the explanation, thanks for helping me see this!

I'd still like advice on how to have the result of "creation from SMILES" and the result of "creation from Mol block" provide the same starting point. It sounds to me like I might need to add coordinates to the SMILES-produced compound before calling AddHs?

Since the coordinate generation from SMILES will, in general, produce different coords than what you find in the molecule from the mol block, the only way to get to the "same starting point" is to remove the coordinates from the molecule that came from the mol block.

My goal here isn't to change how AddHs is called (that was an example); my goal is to have both methods of creation start from the same starting state for molecular operations (except of course the exact coordinates would be different).

The coordinates themselves can actually matter, particularly for the assignment of stereochemistry.

greglandrum commented 8 months ago

I tend to think that AddHs should add coordinates if the molecule already has them as this follow the path of least surprise, but this is a change to current behavior.

I agree with this. We should change the default for addCoords to true and then make sure it doesn't try to add coordinates to anything that doesn't have a conformer.