Open trevorgokey opened 4 years ago
This would be a good one to talk about in-person, but my thoughts briefly are:
guess_bond_orders
. The lowest hanging fruit is 2, but then it would be the last nail in the coffin of making an extra attribute absolutely essential in our "data standards".
Could you clarify whether you think the "nail in the coffin" is a good or bad thing? I (naively) think this is a good idea, but may not be remembering all the context around this.
Right now, the default in the new data submissions is just to hold the hash and the Hill formula (which makes since for a quantum molecule), so we can make an effort to get our CMILES included there as standard.
We could bring this up to MolSSI, but I could see them having good cause to reject adding a widespread requirement for this -- Many of the atomic configurations that QM people study will defy connectivity tables and RDKit's molecule model.
I am always up for discussing this area of the project :)
I'd be happy if we support a pathway for guessing bond orders and such from geometry, but using it should make the user indicate an understanding that something is being guessed. Like, the method name or a required argument should include the phrase guess_bond_orders.
The common way we did this was (taken from the phenyl set)
conn = qcel.molutil.guess_connectivity(molecule.symbols, molecule.geometry)
I think that doing 2) by default is a good idea (though we could add a kwarg to NOT include the CMILES if that becomes unwanted)
That sounds good; I might issue a PR quite soon to get this settled. I think the best way forward in the short-term is always including it, since it is a required piece for from_qcschema
to function. I'd mainly like to get the round-tripping solved. Right now I am unable to use QCSubmit because I am starting with a JSON file full of QCSchema molecules (which QCA will readily accept), but QCSubmit only takes OFFTK molecules. How did I get this JSON? I called to_qcschema
from the toolkit after I generated conformers. For various reasons, I'd like to separate the conformer generation step from the QCA dataset submission step, so round tripping is quite valuable to me.
Could you clarify whether you think the "nail in the coffin" is a good or bad thing? I (naively) think this is a good idea, but may not be remembering all the context around this.
The extras
field is just a dictionary that stores arbitrary data. This means there is no validation, type-checking, etc, and superfluous mistakes like a spelling error could result a completely unusable dataset. The Identifiers
object, however, is a Pydantic model, and has the appropriate fields already designed. I guess what I mean by nail in the coffin is that we would be requiring data in our standards that has much lower quality assurance than the rest of the molecule, which does have a stable data model.
Another aspect is that there is currently some initial discussion to make molecules mutable in the database (https://github.com/MolSSI/QCFractal/issues/602), based on the argument that, since there is no validation done on anything in the extras
, then we should be free to change it however we see fit.
We could bring this up to MolSSI, but I could see them having good cause to reject adding a widespread requirement for this -- Many of the atomic configurations that QM people study will defy connectivity tables and RDKit's molecule model.
As for including CMILES into the standard data we need: MolSSI has already done their part in supplying the proper model to include CMILES data (unless we also require e.g. protomer fields). The effort I am referring to is that the Identifiers
object is constructed and added during QCArchive dataset submission time, which is our part.
This issue can be closed when the following is implemented in a PR:
Molecule
objects can be created from QCEl Molecule
objects that have CMILES in their extras
Molecule
objects can be roundtripped using Molecule.from_qcschema(offmol.to_qcschema)
offmol.to_qcschema
MAY require a kwarg with extras={'canonical_isomeric_explicit_hydrogen_mapped_smiles':offmol.to_smiles(mapped=True)}
offmol.to_qcschema
called generate_cmiles
that does the above logic automatically, noting that it will overwrite CMILES that are already there.attributes
instead of extras
-- We will want to see how frequent this is. Note: Implementation could be as simply as trying to load canonical_isomeric_explicit_hydrogen_mapped_smiles
from BOTH the extras
and attributes
dict of the object, and then ensuring that either
I think this has been fixed?
In [1]: from openff.toolkit.topology import Molecule
Warning: Unable to load toolkit 'OpenEye Toolkit'. The Open Force Field Toolkit does not require the OpenEye Toolkits, and can use RDKit/AmberTools instead. However, if you have a valid license for the OpenEye Toolkits, consider installing them for faster performance and additional file format support: https://docs.eyesopen.com/toolkits/python/quickstart-python/linuxosx.html OpenEye offers free Toolkit licenses for academics: https://www.eyesopen.com/academic-licensing
In [2]: mol = Molecule.from_smiles("CC")
In [3]: mol.generate_conformers()
In [4]: qcmol = mol.to_qcschema()
In [5]: Molecule.from_qcschema(qcmol)
Out[5]: Molecule with name '' and SMILES '[H][C]([H])([H])[C]([H])([H])[H]'
In [6]: Molecule.from_qcschema(qcmol) == mol
Out[6]: True
In [7]: qcmol
Out[7]: Molecule(name='C2H6', formula='C2H6', hash='ea1e725')
In [8]: qcmol.extras
Out[8]: {'canonical_isomeric_explicit_hydrogen_mapped_smiles': '[C:1]([C:2]([H:6])([H:7])[H:8])([H:3])([H:4])[H:5]'}
Trust the QCArchive molecule's geometry, symbols, and connectivity, and build the molecule from that.
IMO this is the only reasonable way to handle this if the QC stack is going to be supported as an input method, much like the toolkit trusts the geometry, symbols, and connectivity of an RDKit molecule.
This will not work if there is not a linear mapping of symbols to geometry.
I'm not sure the concern here; afaik QCElemental validates this on construction, and the elements must map to the geometry for any valid use of the molecule with qcengine, etc.
>>> mol = qcel.models.Molecule(**{"symbols": ["He", "He"], "geometry": [0, 0, -3]})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/models/molecule.py", line 343, in __init__
from_schema(kwargs, nonphysical=nonphysical), dtype=kwargs["schema_version"], copy=False, np_out=True
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/molparse/from_schema.py", line 48, in from_schema
dcontig = contiguize_from_fragment_pattern(
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/molparse/from_schema.py", line 159, in contiguize_from_fragment_pattern
raise ValidationError("""dropped atoms! nat = {} != {}""".format(nat, ncgeom.shape[0]))
qcelemental.exceptions.ValidationError: dropped atoms! nat = 2 != 1
>>> mol = qcel.models.Molecule(**{"symbols": ["Hello", "World"], "geometry": [0, 0, -3, 0, 0, 3]})
Traceback (most recent call last):
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/periodic_table.py", line 77, in resolve_eliso
self._eliso2mass[atom.capitalize()] # type: ignore
KeyError: 'Hello'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/periodic_table.py", line 80, in resolve_eliso
E = self._z2el[int(atom)]
ValueError: invalid literal for int() with base 10: 'Hello'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/periodic_table.py", line 83, in resolve_eliso
E = self._element2el[atom.capitalize()] # type: ignore
KeyError: 'Hello'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/models/molecule.py", line 343, in __init__
from_schema(kwargs, nonphysical=nonphysical), dtype=kwargs["schema_version"], copy=False, np_out=True
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/molparse/from_schema.py", line 60, in from_schema
molrec = from_arrays(
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/molparse/from_arrays.py", line 356, in from_arrays
processed = validate_and_fill_nuclei(
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/molparse/from_arrays.py", line 680, in validate_and_fill_nuclei
*[
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/molparse/from_arrays.py", line 681, in <listcomp>
reconcile_nucleus(
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/molparse/nucleus.py", line 294, in reconcile_nucleus
offer_element_symbol(E)
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/molparse/nucleus.py", line 168, in offer_element_symbol
_Z = periodictable.to_Z(e, strict=True)
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/periodic_table.py", line 186, in to_Z
identifier = self._resolve_atom_to_key(atom, strict=strict)
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/periodic_table.py", line 94, in _resolve_atom_to_key
eliso = resolve_eliso(atom)
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/periodic_table.py", line 85, in resolve_eliso
raise NotAnElementError(atom)
qcelemental.exceptions.NotAnElementError: Hello
Right now from the name of the function a new user would be very confused that they can't do this:
>>> mol = qcel.models.Molecule(**{"symbols": ["He", "He"], "geometry": [0, 0, -3, 0, 0, 3]})
>>> Molecule.from_qcschema(mol)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/openff/toolkit/utils/utils.py", line 86, in wrapper
return function(*args, **kwargs)
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/openff/toolkit/topology/molecule.py", line 5443, in from_qcschema
raise KeyError(
KeyError: "The record must contain the hydrogen mapped smiles to be safely made from the archive. It is not present in either 'attributes' or 'extras' on the provided `qca_record`"
And yet from RDKit:
>>> rdmol = Chem.RWMol()
>>> rdmol.AddAtom(Chem.Atom("He"))
>>> rdmol.AddAtom(Chem.Atom("He"))
>>> Chem.SanitizeMol(rdmol)
>>> Molecule.from_rdkit(rdmol)
Molecule with name '' and SMILES '[He].[He]'
Describe the bug The toolkit molecule
to_qcschema
exports to a schema that represents a QCArchive molecule, however the correspondingfrom_qcschema
actually wants an Entry object, which has the CMILES identifiers. This prevents a round trip to/from qcschema.I think we will need to make some decisions here since this really defines how we interact with QCArchive.
I see two options:
Trust the QCArchive molecule's geometry, symbols, and connectivity, and build the molecule from that. This will not work if there is not a linear mapping of symbols to geometry.
Put the CMILES information into the
extras
field in the QCArchive molecule. This is what has recently been done to e.g. running MM jobs in QCEngine.The first is probably more "appropriate" but this is the not the lowest hanging fruit. The lowest hanging fruit is 2, but then it would be the last nail in the coffin of making an
extra
attribute absolutely essential in our "data standards".Another option that is slightly half-way is using the
Indentifiers
object (https://github.com/MolSSI/QCElemental/blob/master/qcelemental/models/molecule.py#L62) in the molecules, which is designed to specifically hold such things. Right now, the default in the new data submissions is just to hold the hash and the Hill formula (which makes sense for a quantum molecule), so we can make an effort to get our CMILES included there as standard.