openforcefield / openff-toolkit

The Open Forcefield Toolkit provides implementations of the SMIRNOFF format, parameterization engine, and other tools. Documentation available at http://open-forcefield-toolkit.readthedocs.io
http://openforcefield.org
MIT License
309 stars 90 forks source link

Molecule.from_x methods should treat disconnected structures as separate molecules #1009

Open mattwthompson opened 3 years ago

mattwthompson commented 3 years ago

SMILES strings can store "disconnected molecules" like a ligand with a counter-ion. In OpenFF world, these should be treated as separate molecules - they should be parameterized separately, so they should be stored separately.

Probably the simplest example is NaCl:

>>> from openff.toolkit.topology.molecule import Molecule
>>> nacl = Molecule.from_smiles("[Na+].[Cl-]")
>>> assert type(nacl) is list, "nope, looks like it's read in as a single molecule"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError: nope, looks like it's read in as a single molecule

This should probably return a list of Molecules, one for each ion. Presumably there's a way for this case to be processed by the wrapped toolkits.

There are probably other points in the infrastructure this behavior should be considered; Molecule.from_smilies is just where I happened to come across it.

mattwthompson commented 3 years ago

@j-wags suggested the following solution on Slack, which @davidlmobley agreed with:

In the short term, we should have that from_smiles call raise an error. In the long term, we can discuss whether to have that return two molecules (though that would change the return type of from_smiles to sometimes be multiple molecules, which is a bad practice)

adalke commented 3 years ago

Structure file records may have multiple components as well. Consider from_file("salts.smi") containing:

[Na+].[Cl-] table
[Cl-].[NH4+] liquorice

or the corresponding SDF.

I agree it would be error-prone to have a function sometimes return a molecule and other times return a list of molecules. Perhaps from_multicomponent_smiles() and from_multicomponent_file()?

j-wags commented 3 years ago

Perhaps from_multicomponent_smiles() and from_multicomponent_file()?

That's an interesting idea. I'd been thinking that it'd be safest/most generic to make the return type of from_file (and possibly now from_smiles) always be an iterable of Molecules, even if there's just one Molecule in it most of the time. Though the ...multicomponent... calls could actually offer us a reverse-compatible way forward, where we keep the old from_file, but then the from_multicomponent_file would be the thing that always returns a list!

adalke commented 3 years ago

This sort of basic validation should also check for atoms with atomic number 0 ("*" atoms in SMILES, R-groups in molfiles). Plus probably also unsupported bond types.