Open jchodera opened 4 years ago
Tagging #281 -- I've made some major updates to SDF I/O there, and will see if that resolves this issue
Just want to check what we expect the behaviour to be, should we write all conformers to file or just one? I think #281 will make the performance the same when we have properties in the SDF file as they will both only write the first conformer but I am not sure what will happen with multiple conformers and no properties.
I think we need to write all conformers to the file if there are multiple conformers.
But the important thing is that read/write behavior should match between toolkit backends so that code doesn't have to be tailored per toolkit.
We're going to have to deal with a lot of complexity if we start allowing SDF to be multi-conformer. Multi molecule is OK, but multi conformer is not part of the SDF spec, and leads to ambiguity about which properties belong to which molecule/conformer.
An ambiguity occurs when adding SD tag data to an OEMCMolBase and then writing it to SDF. SDF files do not support multiple conformers. However, OEChem TK can automatically read consecutive conformers out of a SDF file into a OEMCMolBase. To preserve the SD data OEChem TK has no choice but to push the data onto the conformers.
OETK does this weird thing where they allow each conformer to have different properties, likely as a result of this exact ambiguity. To implement the same behavior in RDKitToolkitWrapper would be a lot of work on our end. Instead, lots of complexity goes away if we simply say "SDF is NOT a multi-conformer format".
To be clear, #281 in its current form will implement behavior in OFFTK that SDF can be multi-molecule, but we will never assume it is multi-conformer (neither on read nor write)
@j-wags : If we don't allow SDF to be a multi-conformer format, we have to answer a few questions:
What do we do if we try to write a multi-conformer molecule as SDF? This should raise an Exception if the format does not support it.
In this case, we only write the first conformer. Every format conversion we do is capable of losing information -- writing to SMILES will lose partial charge+geometry, writing to mol2 will lose partial bond orders and map indices, etc. We can think hard about when to raise an exception vs. a warning vs. something else, but trying to hand-hold users too much has led to our current forest of (ignored) warnings, so I'm going to be pushing for the "less exceptions and warnings" route in this discussion.
What is our supported portable multi-conformer format if not SDF? Sounds like we don't support any true multi-conformer formats.
Correct. I don't know of a specification for a molecule that mirrors the fields in an OFFMol. In the long run, this will be handled our "interoperable molecule specification", but that doesn't exist yet (though @dgasmith brought it up again when we spoke this morning, so MolSSI would like to get moving on this). We're in this hard place where we can bend the SDF spec for a short-term gain, but in doing so we would start it down the same road as PDB format, and I think that's bad for the community.
Even if we don't support multi-conformer, writing and reading should produce identical gross behavior (in terms of returning lists or single objects) for all backends.
I agree. #281 implements strict behavior (and tests) about how to treat conformers during SDF I/O. The OETKW tests for "strictly single-conformer" behavior are here, and RDKTKW tests are further down in the same file. TL;DR -- #281 does ensure that both will behave the same re: multi-molecule SDFs moving forward.
Correct. I don't know of a specification for a molecule that mirrors the fields in an OFFMol. In the long run, this will be handled our "interoperable molecule specification", but that doesn't exist yet (though @dgasmith brought it up again when we spoke this morning, so MolSSI would like to get moving on this). We're in this hard place where we can bend the SDF spec for a short-term gain, but in doing so we would start it down the same road as PDB format, and I think that's bad for the community.
This is all fine and good, but that doesn't answer the question of "what format do we currently support to bring in multi-conformer molecules"? If the answer is "we don't", then we have a problem, since the object model supports this but we can't read/write/serialize it.
Describe the bug Writing and then reading an SDF file containing multiple conformers gives different behavior on OpenMM and RDKit backends:
To Reproduce
With the openeye toolkit, the first one succeeds, while the second one fails Output
Computing environment (please complete the following information):
osx
conda list
Additional context cc https://github.com/openforcefield/openforcefield/pull/533