openforcefield / openff-toolkit

The Open Forcefield Toolkit provides implementations of the SMIRNOFF format, parameterization engine, and other tools. Documentation available at http://open-forcefield-toolkit.readthedocs.io
http://openforcefield.org
MIT License
311 stars 91 forks source link

Flaky `from_qcschema` tests since QCArchive migration #1650

Open j-wags opened 1 year ago

j-wags commented 1 year ago

I've tried - again - to figure out what's going on with the flaky from_qcschema examples, and have - again - failed.

The proximal error is:

...
openff/toolkit/utils/toolkit_registry.py::toolkit.utils.toolkit_registry.ToolkitRegistry.resolve PASSED [100%]

=================================== FAILURES ===================================
_______ [doctest] toolkit.topology.molecule.FrozenMolecule.from_qcschema _______
4629         molecule : openff.toolkit.topology.Molecule
4630             An OpenFF molecule instance.
4631 
4632         Examples
4633         --------
4634         Get Molecule from a QCArchive molecule record:
4635 
4636         >>> from qcportal import FractalClient
4637         >>> client = FractalClient()
4638         >>> offmol = Molecule.from_qcschema(
UNEXPECTED EXCEPTION: KeyError("The record must contain the hydrogen mapped smiles to be safely made from the archive. It is not present in either 'attributes' or 'extras' on the provided `qca_record`")
Traceback (most recent call last):
  File "/home/runner/micromamba/envs/openff-toolkit-test/lib/python3.9/doctest.py", line 1334, in __run
    exec(compile(example.source, filename, "single",
  File "<doctest toolkit.topology.molecule.FrozenMolecule.from_qcschema[2]>", line 1, in <module>
  File "/home/runner/micromamba/envs/openff-toolkit-test/lib/python3.9/site-packages/openff/utilities/utilities.py", line 80, in wrapper
    return function(*args, **kwargs)
  File "/home/runner/work/openff-toolkit/openff-toolkit/openff/toolkit/topology/molecule.py", line 4714, in from_qcschema
    raise KeyError(
KeyError: "The record must contain the hydrogen mapped smiles to be safely made from the archive. It is not present in either 'attributes' or 'extras' on the provided `qca_record`"
/home/runner/work/openff-toolkit/openff-toolkit/openff/toolkit/topology/molecule.py:4638: UnexpectedException
=============================== warnings summary ===============================
openff/toolkit/topology/molecule.py::toolkit.topology.molecule.FrozenMolecule.from_qcschema
...

I'm unable to reproduce this locally. To try and reproduce the issue I played around with the limit keyword (which seems to default to its maximum value of 2000) and skip keyword (which continues returning 2000 molecules even when raised to values >10,000, but doesn't return anything at values >1,000,000, so it's clearly doing something).

For future work my code is

from openff.toolkit import Molecule
from qcportal import FractalClient
client = FractalClient()
#dataset = client.query_molecules(molecular_formula="C16H20N3O5")
dataset = client.query_molecules(molecular_formula="C7H12N2O4",
                                 #skip=17500,
                                 limit=1)
print(len(dataset))
from matplotlib import pyplot
import numpy 
loadable = numpy.zeros((len(dataset), 1))
for idx, entry in enumerate(dataset):
    try:
        Molecule.from_qcschema(entry)
        loadable[idx] = 1
    except Exception as e:
        pass
pyplot.plot(loadable)

My current hypotheses are either:

Originally posted by @j-wags in https://github.com/openforcefield/openff-toolkit/issues/1646#issuecomment-1597540167

mattwthompson commented 1 year ago

Is there something special about C7H12N2O4 or could a different formula work? If records are goofy they're probably not uniformly goofy across all empirical formulae. Unless they are, which would be a problem.

j-wags commented 1 year ago

C7H12N2O4 is a capped alanine 1-mer, which has all "good" entries as far as I can tell (CMILES present, with fully defined stereo). The old molecule, C16H20N3O5, has some fancy ring stuff that could plausibly confuse a cheminformatics toolkit. The weird thing with the capped alanine 1-mer is that, when I load it on my computer, thousands of records come down, ALL of which are "good". But when CI does it, the first record it gets in the iterator is frequently missing CMILES.

mattwthompson commented 1 year ago

For what it's worth, if I drop the limit argument I get less than 100% of records having CMILES using your code.

Not having investigated either, I noticed a couple of things that could be worth exploring more:

j-wags commented 1 year ago

Working on something else but just stumbled on a possible explanation+fix - https://molssi.github.io/QCFractal/user_guide/molecule.html

Screen Shot 2023-07-06 at 10 05 19 AM
trevorgokey commented 1 year ago

I am running into this as well. On the legacy server, I have code to query batches, but now I am getting some results missing, and this is new behavior. If I immediately query the missing ID by itself, I get a result.