openforcefield / openff-qcsubmit

Automated tools for submitting molecules to QCFractal
https://openff-qcsubmit.readthedocs.io/en/latest/index.html
MIT License
26 stars 4 forks source link

QCSubmit cannot process some existing datasets #216

Closed mattwthompson closed 1 year ago

mattwthompson commented 1 year ago
from openff.qcsubmit.results import OptimizationResultCollection
from qcportal import FractalClient

client = FractalClient()

OptimizationResultCollection.from_server(
    client=client,
    datasets="OpenFF Industry Benchmark Season 1 v1.0",
    spec_name="default",
)

Out

<<<Several parchment scrolls of stereochemistry warnings>>>

ValueError: 'hydrogens_are_explicit' was specified as True, but OpenEye Toolkit interpreted SMILES '[NH+:1]([C@:2]([C:3](=[O:4])[O-:10])([C:5]([C:6]([C:7]([C:8]([NH+:9]([H:22])[H:23])([H:20])[H:21])([H:18])[H:19])([H:16])[H:17])([H:14])[H:15])[H:13])([H:11])[H:12]' as having implicit hydrogen. If this SMILES is intended to express all explicit hydrogens in the molecule, then you should construct the desired molecule as an OEMol (where oechem.OEHasImplicitHydrogens(oemol) returns False), and then use Molecule.from_openeye() to create the desired OFFMol.

This appears to result from newer (0.11+?) versions of the toolkit having different cheminformatics assumptions. There doesn't appear to be a documented way to filter records while grabbing a collection from the server:

In [8]: OptimizationResultCollection.from_server?
Signature:
OptimizationResultCollection.from_server(
    client: qcportal.client.FractalClient,
    datasets: Union[str, Iterable[str]],
    spec_name: str = 'default',
) -> 'OptimizationResultCollection'
Docstring:
Retrieve (and deduplicate) the COMPLETE record ids referenced by the
specified datasets.

Args:
    client: The fractal client that should be used to interface with the running
        QCFractal instance which stores the datasets and their associated
        results records.
    datasets: The names of the datasets
    spec_name: The name of the QC specification that the records to retrieve
        should have been computed using.

Returns:
    A results collection object containing the record ids and a minimal amount
    of associated metadata such as the CMILES of the associated molecule.
File:      ~/mambaforge/envs/openff-interchange-env/lib/python3.10/site-packages/openff/qcsubmit/results/results.py
Type:      method

Creating a separate environment with an older version of the toolkit, grabbing the collection, and serializing out to disk is a workaround. I argue this is not sustainable nor consistent with the ethos that our infrastructure should facilitate scientists accessing existing data with minimal overhead.

In [9]: from openff.toolkit import __version__
   ...:
   ...: print(__version__)
0.13.1

In [10]: from openff.qcsubmit import __version__
    ...: print(__version__)
0.4.1
mattwthompson commented 1 year ago

I have not thought about this too deeply but a possible solution might, in English, to filter things out before the conversion to Molecule.

jthorton commented 1 year ago

Thanks for writing this up @mattwthompson and bringing it back to my attention, I am not quite sure how we managed to get mapped smiles strings with implicit hydrogens from the toolkit but we had a lot of funky inputs from the industry benchmark. I think the filtering makes sense and we should catch the issue around here when creating the datasets. It would also be great if we had some safeguards in place to detect molecules in the dataset generation pipeline with this issue and make all hydrogens explicit if this is something the toolkit does not automatically do now.

related to #166

j-wags commented 1 year ago

Thanks for the breadcrumbs on this. I'm trying to refamiliarize myself with this stack and am testing a few things out. It does seem like the root issue is with the entry as stored on QCA (its CMILES has implicit Hs). I looks like OFFTK 0.13 with the RDKit backend can read the entry, but not with OpenEye. I'll verify whether OFFTK 0.10 can read with both backends tomorrow. But generally it seems like there are two things to follow up on:

  1. Determining whether our stack is still capable of making implicit H CMILES and preventing it from doing that
  2. Trying to make the loader more permissive so a single entry doesn't tank the loading of a whole dataset.
jthorton commented 1 year ago

It would make sense if only the rdkit backend has this issue as we used that backend on this set to match the internal workflows of the partners running the benchmarking to make the molecule preparation consistent.

pavankum commented 1 year ago

I think we were using v1.1 instead of v1.0 because of this issue.

j-wags commented 1 year ago

Thanks Pavan. Unfortunately I'm still seeing the same CMILES with two NH+ in the 1.1 dataset as well. Previous mention of implicit Hs making trouble is in the PR for the 1.1 dataset though. It may be that the actual structure was fixed in those calcs but the metadata (such as CMILES) remained troublesome.