Writing/reading an SDF file containing multiple conformers gives different behavior on OpenEye and RDKit backends

jchodera commented 4 years ago

Describe the bug Writing and then reading an SDF file containing multiple conformers gives different behavior on OpenMM and RDKit backends:

To Reproduce

from openforcefield.utils.toolkits import GLOBAL_TOOLKIT_REGISTRY
from openforcefield.topology import Molecule
smiles = 'Cc1ccc(cc1Nc2nccc(n2)c3cccnc3)NC(=O)c4ccc(cc4)CN5CCN(CC5)C'

# OpenEye
print(GLOBAL_TOOLKIT_REGISTRY._toolkits[0])

molecule = Molecule.from_smiles(smiles)
molecule.generate_conformers(n_conformers=10)
molecule.to_file('imatinib.sdf', file_format='SDF')
loaded_molecules = Molecule.from_file('imatinib.sdf')
print(len(loaded_molecules))

# RDKit
GLOBAL_TOOLKIT_REGISTRY._toolkits.pop(0)
print(GLOBAL_TOOLKIT_REGISTRY._toolkits[0])
molecule = Molecule.from_smiles(smiles)
molecule.generate_conformers(n_conformers=10)
molecule.to_file('imatinib.sdf', file_format='SDF')
loaded_molecules = Molecule.from_file('imatinib.sdf')
print(len(loaded_molecules))

With the openeye toolkit, the first one succeeds, while the second one fails Output

<openforcefield.utils.toolkits.OpenEyeToolkitWrapper object at 0x1168ab810>
10
<openforcefield.utils.toolkits.RDKitToolkitWrapper object at 0x1168ab8d0>
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-2cdddb211e30> in <module>
     20 molecule.to_file('imatinib.sdf', file_format='SDF')
     21 loaded_molecules = Molecule.from_file('imatinib.sdf')
---> 22 print(len(loaded_molecules))

TypeError: object of type 'Molecule' has no len()

Computing environment (please complete the following information):

Operating system: osx

Output of running conda list

 choderaj  (e) base  ~  github  choderalab  perses  conda list | grep open
libblas                   3.8.0               14_openblas    conda-forge
libcblas                  3.8.0               14_openblas    conda-forge
liblapack                 3.8.0               14_openblas    conda-forge
libopenblas               0.3.7                h3d69b6c_7    conda-forge
llvm-openmp               9.0.1                h28b9765_2    conda-forge
openmoltools              0.0.0.dev0               pypi_0    pypi
openssl                   1.1.1d               h0b31af3_0    conda-forge
pyopenssl                 19.1.0                   py37_0

Additional context cc https://github.com/openforcefield/openforcefield/pull/533

j-wags commented 4 years ago

Tagging #281 -- I've made some major updates to SDF I/O there, and will see if that resolves this issue

jthorton commented 4 years ago

Just want to check what we expect the behaviour to be, should we write all conformers to file or just one? I think #281 will make the performance the same when we have properties in the SDF file as they will both only write the first conformer but I am not sure what will happen with multiple conformers and no properties.

jchodera commented 4 years ago

I think we need to write all conformers to the file if there are multiple conformers.

But the important thing is that read/write behavior should match between toolkit backends so that code doesn't have to be tailored per toolkit.

j-wags commented 4 years ago

We're going to have to deal with a lot of complexity if we start allowing SDF to be multi-conformer. Multi molecule is OK, but multi conformer is not part of the SDF spec, and leads to ambiguity about which properties belong to which molecule/conformer.

Per some insightful OETK docs

An ambiguity occurs when adding SD tag data to an OEMCMolBase and then writing it to SDF. SDF files do not support multiple conformers. However, OEChem TK can automatically read consecutive conformers out of a SDF file into a OEMCMolBase. To preserve the SD data OEChem TK has no choice but to push the data onto the conformers.

OETK does this weird thing where they allow each conformer to have different properties, likely as a result of this exact ambiguity. To implement the same behavior in RDKitToolkitWrapper would be a lot of work on our end. Instead, lots of complexity goes away if we simply say "SDF is NOT a multi-conformer format".

j-wags commented 4 years ago

To be clear, #281 in its current form will implement behavior in OFFTK that SDF can be multi-molecule, but we will never assume it is multi-conformer (neither on read nor write)

jchodera commented 4 years ago

@j-wags : If we don't allow SDF to be a multi-conformer format, we have to answer a few questions:

What do we do if we try to write a multi-conformer molecule as SDF? This should raise an Exception if the format does not support it.
What is our supported portable multi-conformer format if not SDF? Sounds like we don't support any true multi-conformer formats.
Even if we don't support multi-conformer, writing and reading should produce identical gross behavior (in terms of returning lists or single objects) for all backends.

j-wags commented 4 years ago

What do we do if we try to write a multi-conformer molecule as SDF? This should raise an Exception if the format does not support it.

In this case, we only write the first conformer. Every format conversion we do is capable of losing information -- writing to SMILES will lose partial charge+geometry, writing to mol2 will lose partial bond orders and map indices, etc. We can think hard about when to raise an exception vs. a warning vs. something else, but trying to hand-hold users too much has led to our current forest of (ignored) warnings, so I'm going to be pushing for the "less exceptions and warnings" route in this discussion.

What is our supported portable multi-conformer format if not SDF? Sounds like we don't support any true multi-conformer formats.

Correct. I don't know of a specification for a molecule that mirrors the fields in an OFFMol. In the long run, this will be handled our "interoperable molecule specification", but that doesn't exist yet (though @dgasmith brought it up again when we spoke this morning, so MolSSI would like to get moving on this). We're in this hard place where we can bend the SDF spec for a short-term gain, but in doing so we would start it down the same road as PDB format, and I think that's bad for the community.

Even if we don't support multi-conformer, writing and reading should produce identical gross behavior (in terms of returning lists or single objects) for all backends.

I agree. #281 implements strict behavior (and tests) about how to treat conformers during SDF I/O. The OETKW tests for "strictly single-conformer" behavior are here, and RDKTKW tests are further down in the same file. TL;DR -- #281 does ensure that both will behave the same re: multi-molecule SDFs moving forward.

jchodera commented 4 years ago

Correct. I don't know of a specification for a molecule that mirrors the fields in an OFFMol. In the long run, this will be handled our "interoperable molecule specification", but that doesn't exist yet (though @dgasmith brought it up again when we spoke this morning, so MolSSI would like to get moving on this). We're in this hard place where we can bend the SDF spec for a short-term gain, but in doing so we would start it down the same road as PDB format, and I think that's bad for the community.

This is all fine and good, but that doesn't answer the question of "what format do we currently support to bring in multi-conformer molecules"? If the answer is "we don't", then we have a problem, since the object model supports this but we can't read/write/serialize it.

openforcefield / openff-toolkit

Writing/reading an SDF file containing multiple conformers gives different behavior on OpenEye and RDKit backends #542