Molecule.to/from_qcschema does not round trip

Describe the bug The toolkit molecule to_qcschema exports to a schema that represents a QCArchive molecule, however the corresponding from_qcschema actually wants an Entry object, which has the CMILES identifiers. This prevents a round trip to/from qcschema.

In [1]: from openforcefield.topology.molecule import Molecule
In [2]: mol = Molecule.from_smiles("CC")
In [3]: mol.generate_conformers()
In [4]: qcmol = mol.to_qcschema()
In [5]: Molecule.from_qcschema(qcmol)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/projects/openforcefield/openforcefield/topology/molecule.py in from_qcschema(cls, qca_record, client, toolkit_registry, allow_undefined_stereo)
   4458         try:
-> 4459             mapped_smiles = qca_record["attributes"][
   4460                 "canonical_isomeric_explicit_hydrogen_mapped_smiles"

KeyError: 'attributes'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-5-d24f74fea5e4> in <module>
----> 1 Molecule.from_qcschema(qcmol)

~/projects/openforcefield/openforcefield/topology/molecule.py in from_qcschema(cls, qca_record, client, toolkit_registry, allow_undefined_stereo)
   4462         except KeyError:
   4463             raise KeyError(
-> 4464                 "The record must contain the hydrogen mapped smiles to be safely made from the archive."
   4465             )
   4466 

KeyError: 'The record must contain the hydrogen mapped smiles to be safely made from the archive.'

I think we will need to make some decisions here since this really defines how we interact with QCArchive.

I see two options:

Trust the QCArchive molecule's geometry, symbols, and connectivity, and build the molecule from that. This will not work if there is not a linear mapping of symbols to geometry.
Put the CMILES information into the extras field in the QCArchive molecule. This is what has recently been done to e.g. running MM jobs in QCEngine.

The first is probably more "appropriate" but this is the not the lowest hanging fruit. The lowest hanging fruit is 2, but then it would be the last nail in the coffin of making an extra attribute absolutely essential in our "data standards".

Another option that is slightly half-way is using the Indentifiers object (https://github.com/MolSSI/QCElemental/blob/master/qcelemental/models/molecule.py#L62) in the molecules, which is designed to specifically hold such things. Right now, the default in the new data submissions is just to hold the hash and the Hill formula (which makes sense for a quantum molecule), so we can make an effort to get our CMILES included there as standard.

This would be a good one to talk about in-person, but my thoughts briefly are:

I think that doing 2) by default is a good idea (though we could add a kwarg to NOT include the CMILES if that becomes unwanted)
I'd be happy if we support a pathway for guessing bond orders and such from geometry, but using it should make the user indicate an understanding that something is being guessed. Like, the method name or a required argument should include the phrase guess_bond_orders.

The lowest hanging fruit is 2, but then it would be the last nail in the coffin of making an extra attribute absolutely essential in our "data standards".

Could you clarify whether you think the "nail in the coffin" is a good or bad thing? I (naively) think this is a good idea, but may not be remembering all the context around this.

Right now, the default in the new data submissions is just to hold the hash and the Hill formula (which makes since for a quantum molecule), so we can make an effort to get our CMILES included there as standard.

We could bring this up to MolSSI, but I could see them having good cause to reject adding a widespread requirement for this -- Many of the atomic configurations that QM people study will defy connectivity tables and RDKit's molecule model.

I am always up for discussing this area of the project :)

I'd be happy if we support a pathway for guessing bond orders and such from geometry, but using it should make the user indicate an understanding that something is being guessed. Like, the method name or a required argument should include the phrase guess_bond_orders.

The common way we did this was (taken from the phenyl set)

conn = qcel.molutil.guess_connectivity(molecule.symbols, molecule.geometry)

I think that doing 2) by default is a good idea (though we could add a kwarg to NOT include the CMILES if that becomes unwanted)

That sounds good; I might issue a PR quite soon to get this settled. I think the best way forward in the short-term is always including it, since it is a required piece for from_qcschema to function. I'd mainly like to get the round-tripping solved. Right now I am unable to use QCSubmit because I am starting with a JSON file full of QCSchema molecules (which QCA will readily accept), but QCSubmit only takes OFFTK molecules. How did I get this JSON? I called to_qcschema from the toolkit after I generated conformers. For various reasons, I'd like to separate the conformer generation step from the QCA dataset submission step, so round tripping is quite valuable to me.

Could you clarify whether you think the "nail in the coffin" is a good or bad thing? I (naively) think this is a good idea, but may not be remembering all the context around this.

The extras field is just a dictionary that stores arbitrary data. This means there is no validation, type-checking, etc, and superfluous mistakes like a spelling error could result a completely unusable dataset. The Identifiers object, however, is a Pydantic model, and has the appropriate fields already designed. I guess what I mean by nail in the coffin is that we would be requiring data in our standards that has much lower quality assurance than the rest of the molecule, which does have a stable data model.

Another aspect is that there is currently some initial discussion to make molecules mutable in the database (https://github.com/MolSSI/QCFractal/issues/602), based on the argument that, since there is no validation done on anything in the extras, then we should be free to change it however we see fit.

We could bring this up to MolSSI, but I could see them having good cause to reject adding a widespread requirement for this -- Many of the atomic configurations that QM people study will defy connectivity tables and RDKit's molecule model.

As for including CMILES into the standard data we need: MolSSI has already done their part in supplying the proper model to include CMILES data (unless we also require e.g. protomer fields). The effort I am referring to is that the Identifiers object is constructed and added during QCArchive dataset submission time, which is our part.

This issue can be closed when the following is implemented in a PR:

OpenFF Molecule objects can be created from QCEl Molecule objects that have CMILES in their extras
OpenFF Molecule objects can be roundtripped using Molecule.from_qcschema(offmol.to_qcschema)
- offmol.to_qcschema MAY require a kwarg with extras={'canonical_isomeric_explicit_hydrogen_mapped_smiles':offmol.to_smiles(mapped=True)}
- add a kwarg to offmol.to_qcschema called generate_cmiles that does the above logic automatically, noting that it will overwrite CMILES that are already there.
There's a custom exception raised in the code at the point where a CMILES is NOT present (this will mark where we could implement guessing graph from geometry)
Add tests for all new code pathways above
Test on broad swaths of molecules in QCArchive (so we can see if there is an alternate location where CMILES were stored, or if there are some datasets that we can't load reliably)
- Sometimes the CMILES is stored in a dict under attributes instead of extras -- We will want to see how frequent this is.

Note: Implementation could be as simply as trying to load canonical_isomeric_explicit_hydrogen_mapped_smiles from BOTH the extras and attributes dict of the object, and then ensuring that either

only one of them is set, OR
if both are defined, then they're identical

I think this has been fixed?


In [1]: from openff.toolkit.topology import Molecule
Warning: Unable to load toolkit 'OpenEye Toolkit'. The Open Force Field Toolkit does not require the OpenEye Toolkits, and can use RDKit/AmberTools instead. However, if you have a valid license for the OpenEye Toolkits, consider installing them for faster performance and additional file format support: https://docs.eyesopen.com/toolkits/python/quickstart-python/linuxosx.html OpenEye offers free Toolkit licenses for academics: https://www.eyesopen.com/academic-licensing

In [2]: mol = Molecule.from_smiles("CC")

In [3]: mol.generate_conformers()

In [4]: qcmol = mol.to_qcschema()

In [5]: Molecule.from_qcschema(qcmol)
Out[5]: Molecule with name '' and SMILES '[H][C]([H])([H])[C]([H])([H])[H]'

In [6]: Molecule.from_qcschema(qcmol) == mol
Out[6]: True

In [7]: qcmol
Out[7]: Molecule(name='C2H6', formula='C2H6', hash='ea1e725')

In [8]: qcmol.extras
Out[8]: {'canonical_isomeric_explicit_hydrogen_mapped_smiles': '[C:1]([C:2]([H:6])([H:7])[H:8])([H:3])([H:4])[H:5]'}

Trust the QCArchive molecule's geometry, symbols, and connectivity, and build the molecule from that.

IMO this is the only reasonable way to handle this if the QC stack is going to be supported as an input method, much like the toolkit trusts the geometry, symbols, and connectivity of an RDKit molecule.

This will not work if there is not a linear mapping of symbols to geometry.

I'm not sure the concern here; afaik QCElemental validates this on construction, and the elements must map to the geometry for any valid use of the molecule with qcengine, etc.

>>> mol = qcel.models.Molecule(**{"symbols": ["He", "He"], "geometry": [0, 0, -3]})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/models/molecule.py", line 343, in __init__
    from_schema(kwargs, nonphysical=nonphysical), dtype=kwargs["schema_version"], copy=False, np_out=True
  File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/molparse/from_schema.py", line 48, in from_schema
    dcontig = contiguize_from_fragment_pattern(
  File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/molparse/from_schema.py", line 159, in contiguize_from_fragment_pattern
    raise ValidationError("""dropped atoms! nat = {} != {}""".format(nat, ncgeom.shape[0]))
qcelemental.exceptions.ValidationError: dropped atoms! nat = 2 != 1

>>> mol = qcel.models.Molecule(**{"symbols": ["Hello", "World"], "geometry": [0, 0, -3, 0, 0, 3]})
Traceback (most recent call last):
  File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/periodic_table.py", line 77, in resolve_eliso
    self._eliso2mass[atom.capitalize()]  # type: ignore
KeyError: 'Hello'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/periodic_table.py", line 80, in resolve_eliso
    E = self._z2el[int(atom)]
ValueError: invalid literal for int() with base 10: 'Hello'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/periodic_table.py", line 83, in resolve_eliso
    E = self._element2el[atom.capitalize()]  # type: ignore
KeyError: 'Hello'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/models/molecule.py", line 343, in __init__
    from_schema(kwargs, nonphysical=nonphysical), dtype=kwargs["schema_version"], copy=False, np_out=True
  File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/molparse/from_schema.py", line 60, in from_schema
    molrec = from_arrays(
  File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/molparse/from_arrays.py", line 356, in from_arrays
    processed = validate_and_fill_nuclei(
  File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/molparse/from_arrays.py", line 680, in validate_and_fill_nuclei
    *[
  File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/molparse/from_arrays.py", line 681, in <listcomp>
    reconcile_nucleus(
  File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/molparse/nucleus.py", line 294, in reconcile_nucleus
    offer_element_symbol(E)
  File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/molparse/nucleus.py", line 168, in offer_element_symbol
    _Z = periodictable.to_Z(e, strict=True)
  File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/periodic_table.py", line 186, in to_Z
    identifier = self._resolve_atom_to_key(atom, strict=strict)
  File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/periodic_table.py", line 94, in _resolve_atom_to_key
    eliso = resolve_eliso(atom)
  File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/periodic_table.py", line 85, in resolve_eliso
    raise NotAnElementError(atom)
qcelemental.exceptions.NotAnElementError: Hello

Right now from the name of the function a new user would be very confused that they can't do this:

>>> mol = qcel.models.Molecule(**{"symbols": ["He", "He"], "geometry": [0, 0, -3, 0, 0, 3]})
>>> Molecule.from_qcschema(mol)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/openff/toolkit/utils/utils.py", line 86, in wrapper
    return function(*args, **kwargs)
  File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/openff/toolkit/topology/molecule.py", line 5443, in from_qcschema
    raise KeyError(
KeyError: "The record must contain the hydrogen mapped smiles to be safely made from the archive. It is not present in either 'attributes' or 'extras' on the provided `qca_record`"

And yet from RDKit:

>>> rdmol = Chem.RWMol()
>>> rdmol.AddAtom(Chem.Atom("He"))
>>> rdmol.AddAtom(Chem.Atom("He"))
>>> Chem.SanitizeMol(rdmol)
>>> Molecule.from_rdkit(rdmol)
Molecule with name '' and SMILES '[He].[He]'

openforcefield / openff-toolkit

Molecule.to/from_qcschema does not round trip #720