openforcefield / openff-toolkit

The Open Forcefield Toolkit provides implementations of the SMIRNOFF format, parameterization engine, and other tools. Documentation available at http://open-forcefield-toolkit.readthedocs.io
http://openforcefield.org
MIT License
312 stars 92 forks source link

Enforce that `Molecule` cannot represent multiple molecules #1587

Closed mattwthompson closed 1 week ago

mattwthompson commented 1 year ago

Is your feature request related to a problem? Please describe.

The Molecule class is intended to represent a molecule - defined as a single connected graph of atoms - but has little or no handling for inputs that contain multiple molecules.

>>> from openff.toolkit.topology.molecule import Molecule, FrozenMolecule, nx
>>> FrozenMolecule.__doc__.split("\n")[1]
'    Immutable chemical representation of a molecule, such as a small molecule or biopolymer.'
>>> Molecule.__doc__.split("\n")[1]
'    Mutable chemical representation of a molecule, such as a small molecule or biopolymer.'
>>> Molecule.from_smiles("O.O")
Molecule with name '' and SMILES '[H]O[H].[H]O[H]'
>>> nx.number_connected_components(Molecule.from_file("complex.pdb").to_networkx())
2165

This gives me the impression that I can safely feed the toolkit an arbitrary PDB file, which I didn't think was the case?

This also leads to dangerous behavior when feeding things through to other parts of the API that may or may not handle the case of multi-molecule Molecules, i.e.

In [8]: Molecule.from_smiles("O.O").assign_partial_charges(partial_charge_method
   ...: ="am1bcc")
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
/Users/mattthompson/mambaforge/envs/openff-interchange-env/bin/wrapped_progs/antechamber: Fatal Error!
Unable to find sqm charges in file (sqm.out).
Verify the filename and the file contents.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[8], line 1
----> 1 Molecule.from_smiles("O.O").assign_partial_charges(partial_charge_method="am1bcc")

File ~/mambaforge/envs/openff-interchange-env/lib/python3.10/site-packages/openff/toolkit/topology/molecule.py:2503, in FrozenMolecule.assign_partial_charges(self, partial_charge_method, strict_n_conformers, use_conformers, toolkit_registry, normalize_partial_charges)
   2490         warnings.warn(
   2491             f"Warning! Partial charge method '{partial_charge_method}' is not designed "
   2492             "for use on large (i.e. > 150 atoms) molecules and may crash or take hours to "
   (...)
   2495             "#parameterizing-my-system-which-contains-a-large-molecule-is-taking-forever-whats-wrong",
   2496         )
   2498 if isinstance(toolkit_registry, ToolkitRegistry):
   2499     # We may need to try several toolkitwrappers to find one
   2500     # that supports the desired partial charge method, so we
   2501     # tell the ToolkitRegistry to continue trying ToolkitWrappers
   2502     # if one raises an error (raise_exception_types=[])
-> 2503     toolkit_registry.call(
   2504         "assign_partial_charges",
   2505         molecule=self,
   2506         partial_charge_method=partial_charge_method,
   2507         use_conformers=use_conformers,
   2508         strict_n_conformers=strict_n_conformers,
   2509         normalize_partial_charges=normalize_partial_charges,
   2510         raise_exception_types=[],
   2511         _cls=self.__class__,
   2512     )
   2513 elif isinstance(toolkit_registry, ToolkitWrapper):
   2514     toolkit_wrapper: ToolkitWrapper = toolkit_registry

File ~/mambaforge/envs/openff-interchange-env/lib/python3.10/site-packages/openff/toolkit/utils/toolkit_registry.py:370, in ToolkitRegistry.call(self, method_name, raise_exception_types, *args, **kwargs)
    368 for toolkit, error in errors:
    369     msg += " {} {} : {}\n".format(toolkit, type(error), error)
--> 370 raise ValueError(msg)

ValueError: No registered toolkits can provide the capability "assign_partial_charges" for args "()" and kwargs "{'molecule': Molecule with name '' and SMILES '[H]O[H].[H]O[H]', 'partial_charge_method': 'am1bcc', 'use_conformers': None, 'strict_n_conformers': False, 'normalize_partial_charges': True, '_cls': <class 'openff.toolkit.topology.molecule.Molecule'>}"
Available toolkits are: [ToolkitWrapper around OpenEye Toolkit version 2022.2.2, ToolkitWrapper around The RDKit version 2022.09.5, ToolkitWrapper around AmberTools version 22.0, ToolkitWrapper around Built-in Toolkit version None]
 ToolkitWrapper around OpenEye Toolkit version 2022.2.2 <class 'openff.toolkit.utils.exceptions.InvalidConformerError'> : molecule.add_conformer given input of the wrong shape: Given (3, 3), expected (6, 3)
 ToolkitWrapper around The RDKit version 2022.09.5 <class 'openff.toolkit.utils.exceptions.ChargeMethodUnavailableError'> : partial_charge_method 'am1bcc' is not available from RDKitToolkitWrapper. Available charge methods are ['mmff94', 'gasteiger']
 ToolkitWrapper around AmberTools version 22.0 <class 'subprocess.CalledProcessError'> : Command '['antechamber', '-i', 'molecule.sdf', '-fi', 'sdf', '-o', 'charged.mol2', '-fo', 'mol2', '-pf', 'yes', '-dr', 'n', '-c', 'bcc', '-nc', '0.0']' returned non-zero exit status 1.
 ToolkitWrapper around Built-in Toolkit version None <class 'openff.toolkit.utils.exceptions.ChargeMethodUnavailableError'> : Partial charge method "am1bcc"" is not supported by the Built-in toolkit. Available charge methods are ['zeros', 'formal_charge']

Describe the solution you'd like

Raise an exception when parsing inputs containing multiple disconnected molecular graphs.

Describe alternatives you've considered

An alternative involves adding a vast number of tests involving passing a multi-molecule Molecule. I don't really want to do that.

Additional context

I raised a similar suggestion years ago, but didn't get much traction.

Yoshanuikabundi commented 1 year ago

+1 but I think we should make a clear decision about how to recommend users handle organic salts - is the metal ion its own molecule or should it be bonded to the anion?

j-wags commented 3 weeks ago

This is a good idea but it will

I'm leaning toward "do it anyway", but maybe we start with emitting a warning for a while so that folks who will be affected can know to downpin/work with us on migrating their use cases.

mattwthompson commented 3 weeks ago

break user workflows that rely on multi-molecule Molecules

I'd argue these should be broken, since the class does not appear to be designed to handle multi-molecule structures, isn't tested for this use case and the various permutations (here or downstream!), and this causes many sharp edges that are easy to demonstrate, i.e.


In [1]: from openff.toolkit import Molecule, ForceField

In [2]: Molecule.from_smiles("C.C").to_topology().n_molecules
Out[2]: 1

In [3]: ForceField("openff-2.2.0.offxml").create_interchange(
   ...:     Molecule.from_smiles("C.C").to_topology()
   ...: )
/Users/mattthompson/micromamba/envs/yammbs-dev/lib/python3.12/site-packages/openff/toolkit/typing/engines/smirnoff/forcefield.py:101: SyntaxWarning: invalid escape sequence '\*'
  """
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[3], line 1
----> 1 ForceField("openff-2.2.0.offxml").create_interchange(
      2     Molecule.from_smiles("C.C").to_topology()
      3 )

File ~/micromamba/envs/yammbs-dev/lib/python3.12/site-packages/openff/utilities/utilities.py:80, in requires_package.<locals>.inner_decorator.<locals>.wrapper(*args, **kwargs)
     77 except Exception as e:
     78     raise e
---> 80 return function(*args, **kwargs)

File ~/micromamba/envs/yammbs-dev/lib/python3.12/site-packages/openff/toolkit/typing/engines/smirnoff/forcefield.py:1252, in ForceField.create_interchange(self, topology, toolkit_registry, charge_from_molecules, partial_bond_orders_from_molecules, allow_nonintegral_charges)
   1249     used_registry = GLOBAL_TOOLKIT_REGISTRY
   1251 with toolkit_registry_manager(used_registry):
-> 1252     return Interchange.from_smirnoff(
   1253         force_field=self,
   1254         topology=topology,
   1255         charge_from_molecules=charge_from_molecules,
   1256         partial_bond_orders_from_molecules=partial_bond_orders_from_molecules,
   1257         allow_nonintegral_charges=allow_nonintegral_charges,
   1258     )

File ~/micromamba/envs/yammbs-dev/lib/python3.12/site-packages/openff/interchange/components/interchange.py:268, in Interchange.from_smirnoff(cls, force_field, topology, box, positions, charge_from_molecules, partial_bond_orders_from_molecules, allow_nonintegral_charges)
    219 """
    220 Create a new object by parameterizing a topology with a SMIRNOFF force field.
    221
   (...)
    264
    265 """
    266 from openff.interchange.smirnoff._create import _create_interchange
--> 268 return _create_interchange(
    269     force_field=force_field,
    270     topology=topology,
    271     box=box,
    272     positions=positions,
    273     charge_from_molecules=charge_from_molecules,
    274     partial_bond_orders_from_molecules=partial_bond_orders_from_molecules,
    275     allow_nonintegral_charges=allow_nonintegral_charges,
    276 )

File ~/micromamba/envs/yammbs-dev/lib/python3.12/site-packages/openff/interchange/smirnoff/_create.py:112, in _create_interchange(force_field, topology, box, positions, charge_from_molecules, partial_bond_orders_from_molecules, allow_nonintegral_charges)
    109 _impropers(interchange, force_field, _topology)
    111 _vdw(interchange, force_field, _topology)
--> 112 _electrostatics(
    113     interchange,
    114     force_field,
    115     _topology,
    116     charge_from_molecules,
    117     allow_nonintegral_charges,
    118 )
    119 _plugins(interchange, force_field, _topology)
    121 _virtual_sites(interchange, force_field, _topology)

File ~/micromamba/envs/yammbs-dev/lib/python3.12/site-packages/openff/interchange/smirnoff/_create.py:267, in _electrostatics(interchange, force_field, topology, charge_from_molecules, allow_nonintegral_charges)
    262     else:
    263         return
    265 interchange.collections.update(
    266     {
--> 267         "Electrostatics": SMIRNOFFElectrostaticsCollection.create(
    268             parameter_handler=[
    269                 handler
    270                 for handler in [
    271                     force_field._parameter_handlers.get(name, None)
    272                     for name in [
    273                         "Electrostatics",
    274                         "ChargeIncrementModel",
    275                         "ToolkitAM1BCC",
    276                         "LibraryCharges",
    277                     ]
    278                 ]
    279                 if handler is not None
    280             ],
    281             topology=topology,
    282             charge_from_molecules=charge_from_molecules,
    283             allow_nonintegral_charges=allow_nonintegral_charges,
    284         ),
    285     },
    286 )

File ~/micromamba/envs/yammbs-dev/lib/python3.12/site-packages/openff/interchange/smirnoff/_nonbonded.py:452, in SMIRNOFFElectrostaticsCollection.create(cls, parameter_handler, topology, charge_from_molecules, allow_nonintegral_charges)
    436 toolkit_handler_with_metadata = [
    437     p for p in parameter_handlers if type(p) is ElectrostaticsHandler
    438 ][0]
    440 handler = cls(
    441     type=toolkit_handler_with_metadata.TAGNAME,
    442     scale_12=toolkit_handler_with_metadata.scale12,
   (...)
    449     exception_potential=toolkit_handler_with_metadata.exception_potential,
    450 )
--> 452 handler.store_matches(
    453     parameter_handlers,
    454     topology,
    455     charge_from_molecules=charge_from_molecules,
    456     allow_nonintegral_charges=allow_nonintegral_charges,
    457 )
    458 handler._charges = dict()
    460 return handler

File ~/micromamba/envs/yammbs-dev/lib/python3.12/site-packages/openff/interchange/smirnoff/_nonbonded.py:863, in SMIRNOFFElectrostaticsCollection.store_matches(self, parameter_handler, topology, charge_from_molecules, allow_nonintegral_charges)
    858 # TODO: Here is where the toolkit calls self.check_charges_assigned(). Do we skip this
    859 #       entirely given that we are not accepting `charge_from_molecules`?
    861 if not flag:
    862     # TODO: Rename this method to something like `_find_matches`
--> 863     matches, potentials = self._find_reference_matches(
    864         parameter_handlers,
    865         unique_molecule,
    866     )
    868 self.potentials.update(potentials)
    870 for unique_molecule_atom in unique_molecule.atoms:

File ~/micromamba/envs/yammbs-dev/lib/python3.12/site-packages/openff/interchange/smirnoff/_nonbonded.py:721, in SMIRNOFFElectrostaticsCollection._find_reference_matches(cls, parameter_handlers, unique_molecule)
    711     slot_matches, slot_potentials = cls._find_slot_matches(
    712         parameter_handler,
    713         unique_molecule,
    714     )
    716 if handler_type in ["ToolkitAM1BCC", "ChargeIncrementModel"]:
    717     (
    718         partial_charge_method,
    719         am1_matches,
    720         am1_potentials,
--> 721     ) = cls._find_charge_model_matches(
    722         parameter_handler,
    723         unique_molecule,
    724     )
    726 if slot_matches is None and am1_matches is None:
    727     raise NotImplementedError()

File ~/micromamba/envs/yammbs-dev/lib/python3.12/site-packages/openff/interchange/smirnoff/_nonbonded.py:657, in SMIRNOFFElectrostaticsCollection._find_charge_model_matches(cls, parameter_handler, unique_molecule)
    651 else:
    652     raise InvalidParameterHandlerError(
    653         f"Encountered unknown handler of type {type(parameter_handler)} where only "
    654         "ToolkitAM1BCCHandler or ChargeIncrementModelHandler are expected.",
    655     )
--> 657 partial_charges = cls._compute_partial_charges(
    658     unique_molecule,
    659     unique_molecule.to_smiles(
    660         isomeric=True,
    661         explicit_hydrogens=True,
    662         mapped=True,
    663     ),
    664     method=partial_charge_method,
    665 )
    667 matches = {}
    668 potentials = {}

File ~/micromamba/envs/yammbs-dev/lib/python3.12/site-packages/openff/interchange/smirnoff/_nonbonded.py:472, in SMIRNOFFElectrostaticsCollection._compute_partial_charges(cls, molecule, mapped_smiles, method)
    470 """Call out to the toolkit's toolkit wrappers to generate partial charges."""
    471 molecule = copy.deepcopy(molecule)
--> 472 molecule.assign_partial_charges(method)
    474 return molecule.partial_charges

File ~/micromamba/envs/yammbs-dev/lib/python3.12/site-packages/openff/toolkit/topology/molecule.py:2671, in FrozenMolecule.assign_partial_charges(self, partial_charge_method, strict_n_conformers, use_conformers, toolkit_registry, normalize_partial_charges)
   2657         warnings.warn(
   2658             f"Warning! Partial charge method '{partial_charge_method}' is not designed "
   2659             "for use on large (i.e. > 150 atoms) molecules and may crash or take hours to "
   (...)
   2663             stacklevel=2,
   2664         )
   2666 if isinstance(toolkit_registry, ToolkitRegistry):
   2667     # We may need to try several toolkitwrappers to find one
   2668     # that supports the desired partial charge method, so we
   2669     # tell the ToolkitRegistry to continue trying ToolkitWrappers
   2670     # if one raises an error (raise_exception_types=[])
-> 2671     toolkit_registry.call(
   2672         "assign_partial_charges",
   2673         molecule=self,
   2674         partial_charge_method=partial_charge_method,
   2675         use_conformers=use_conformers,
   2676         strict_n_conformers=strict_n_conformers,
   2677         normalize_partial_charges=normalize_partial_charges,
   2678         raise_exception_types=[],
   2679         _cls=self.__class__,
   2680     )
   2681 elif isinstance(toolkit_registry, ToolkitWrapper):
   2682     toolkit_wrapper: ToolkitWrapper = toolkit_registry

File ~/micromamba/envs/yammbs-dev/lib/python3.12/site-packages/openff/toolkit/utils/toolkit_registry.py:384, in ToolkitRegistry.call(self, method_name, raise_exception_types, *args, **kwargs)
    382 for toolkit, error in errors:
    383     msg += f" {toolkit} {type(error)} : {error}\n"
--> 384 raise ValueError(msg)

ValueError: No registered toolkits can provide the capability "assign_partial_charges" for args "()" and kwargs "{'molecule': Molecule with name '' and SMILES '[H]C([H])([H])[H].[H]C([H])([H])[H]', 'partial_charge_method': 'am1bccelf10', 'use_conformers': None, 'strict_n_conformers': False, 'normalize_partial_charges': True, '_cls': <class 'openff.toolkit.topology.molecule.Molecule'>}"
Available toolkits are: [ToolkitWrapper around OpenEye Toolkit version 2024.1.1, ToolkitWrapper around The RDKit version 2024.03.5, ToolkitWrapper around AmberTools version 22.0, ToolkitWrapper around Built-in Toolkit version None]
 ToolkitWrapper around OpenEye Toolkit version 2024.1.1 <class 'openff.toolkit.utils.exceptions.InvalidConformerError'> : molecule.add_conformer given input of the wrong shape: Given (5, 3), expected (10, 3)
 ToolkitWrapper around The RDKit version 2024.03.5 <class 'openff.toolkit.utils.exceptions.ChargeMethodUnavailableError'> : partial_charge_method 'am1bccelf10' is not available from RDKitToolkitWrapper. Available charge methods are {'mmff94': {}, 'gasteiger': {}}
 ToolkitWrapper around AmberTools version 22.0 <class 'openff.toolkit.utils.exceptions.ChargeMethodUnavailableError'> : partial_charge_method 'am1bccelf10' is not available from AmberToolsToolkitWrapper. Available charge methods are {'am1bcc': {'antechamber_keyword': 'bcc', 'min_confs': 1, 'max_confs': 1, 'rec_confs': 1}, 'am1-mulliken': {'antechamber_keyword': 'mul', 'min_confs': 1, 'max_confs': 1, 'rec_confs': 1}, 'gasteiger': {'antechamber_keyword': 'gas', 'min_confs': 0, 'max_confs': 0, 'rec_confs': 0}}
 ToolkitWrapper around Built-in Toolkit version None <class 'openff.toolkit.utils.exceptions.ChargeMethodUnavailableError'> : Partial charge method "am1bccelf10"" is not supported by the Built-in toolkit. Available charge methods are {'zeros': {'rec_confs': 0, 'min_confs': 0, 'max_confs': 0}, 'formal_charge': {'rec_confs': 0, 'min_confs': 0, 'max_confs': 0}}
mrshirts commented 3 weeks ago

Is the right/simplest thing when passed something that is multimolecule to recognize that on read in, and split it into two molecules, so that it returns a list of molecules, rather a single molecule?

mrshirts commented 3 weeks ago

+1 but I think we should make a clear decision about how to recommend users handle organic salts - is the metal ion its own molecule or should it be bonded to the anion?

I'm pretty sure it should be it's own molecule? That makes things much simpler overall.

lilyminium commented 3 weeks ago

Is the right/simplest thing when passed something that is multimolecule to recognize that on read in, and split it into two molecules, so that it returns a list of molecules, rather a single molecule?

This is what happens with Molecule.from_file when it's a list of SMILES, so there's precedent for Molecule.from_x returning a list of molecules rather than a single one.

I'm leaning toward "do it anyway", but maybe we start with emitting a warning for a while so that folks who will be affected can know to downpin/work with us on migrating their use cases.

Warnings either way would be handy -- if the ultimate decision winds up being "do nothing" because it would break too many workflows, a warning that multimolecule Molecules are not fully supported would still be helpful.

j-wags commented 2 weeks ago

I think warnings are the right way to go here. Continuing to do nothing gives us a weak object model where developers who try to build off of our core objects get burned by unforeseen possibilities (like what happened here). But stopping support for this altogether will blow up SPICE and probably trip up people trying to do any early work with metals.

(Molecule.from_file DOES have the behavior of sometimes returning a Molecule and other times returning a list of Molecules and I consider that to be a big mistake - From what I've heard lots of people have been burned by that and nobody has told me they appreciate the behavior or thought it was a good design decision)

Metal representations will be a whole additional question to solve (SMARTS-based parameter assignment will depend heavily on how that's handled), so let's cross that bridge once we get to it.