openforcefield / openff-sage

Scripts, inputs and the results generated as part of the training the Sage line of OpenFF force fields.
MIT License
20 stars 3 forks source link

failing molecule from industry benchmark set with rdkit #2

Closed pavankum closed 3 years ago

pavankum commented 3 years ago

@SimonBoothroyd Here is a traceback for one molecule that fails to get through qcsubmit download from server function call, the same works well when a molecule is created directly from this smiles using openff toolkit, so not sure what's happening:

1a) Parsing collection
Traceback (most recent call last):
  File "/home/maverick/Desktop/OpenFF/dev-dir/openff-sage/inputs-and-results/benchmarks/qc-opt-geo/01-setup.py", line 211, in <module>
    main()
  File "/home/maverick/anaconda3/envs/openff-force-fields/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/maverick/anaconda3/envs/openff-force-fields/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/maverick/anaconda3/envs/openff-force-fields/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/maverick/anaconda3/envs/openff-force-fields/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/maverick/Desktop/OpenFF/dev-dir/openff-sage/inputs-and-results/benchmarks/qc-opt-geo/01-setup.py", line 118, in main
    result_collection = OptimizationResultCollection.from_server(
  File "/home/maverick/anaconda3/envs/openff-force-fields/lib/python3.9/site-packages/openff/qcsubmit/results/results.py", line 482, in from_server
    return cls.from_datasets(
  File "/home/maverick/anaconda3/envs/openff-force-fields/lib/python3.9/site-packages/openff/qcsubmit/results/results.py", line 440, in from_datasets
    {
  File "/home/maverick/anaconda3/envs/openff-force-fields/lib/python3.9/site-packages/openff/qcsubmit/results/results.py", line 449, in <dictcomp>
    Molecule.from_mapped_smiles(
  File "/home/maverick/anaconda3/envs/openff-force-fields/lib/python3.9/site-packages/openff/toolkit/topology/molecule.py", line 5299, in from_mapped_smiles
    offmol = cls.from_smiles(
  File "/home/maverick/anaconda3/envs/openff-force-fields/lib/python3.9/site-packages/openff/toolkit/topology/molecule.py", line 2923, in from_smiles
    molecule = toolkit_registry.call(
  File "/home/maverick/anaconda3/envs/openff-force-fields/lib/python3.9/site-packages/openff/toolkit/utils/toolkit_registry.py", line 381, in call
    raise ValueError(msg)
ValueError: No registered toolkits can provide the capability "from_smiles" for args "('[N:1]([C:2](=[O:3])[c:4]1[c:5]([H:26])[c:6]([H:27])[c:7]([N:8]([C:9](=[O:10])[C:11]([N:12]2[C:13]([H:31])([H:32])[C:14]([H:33])([H:34])[C:15]([N:18]([C:19]([O:20][H:41])=[O:21])[H:40])([H:35])[C:16]([H:36])([H:37])[C:17]2([H:38])[H:39])([H:29])[H:30])[H:28])[c:22]([H:42])[c:23]1[H:43])([H:24])[H:25]',)" and kwargs "{'hydrogens_are_explicit': True, 'allow_undefined_stereo': True, '_cls': <class 'openff.toolkit.topology.molecule.Molecule'>}"
Available toolkits are: []

This was without openeye and using rdkit, the versions of packages are:

openff-qcsubmit           0.2.4                    pypi_0    pypi
openff-toolkit            0.10.0             pyhd8ed1ab_0    conda-forge
openff-toolkit-base       0.10.0             pyhd8ed1ab_0    conda-forge
rdkit                     2021.03.5        py39hccf6a74_0    conda-forge
SimonBoothroyd commented 3 years ago

Thanks for posting this. I believe this was intentional as I wanted to try and be consistent with the use of OE rather than some things being handler by AT + RDKit and some by OE:

https://github.com/openforcefield/openff-sage/blob/325460b254c3532b96ae43deb5a3a963605609b6/inputs-and-results/benchmarks/qc-opt-geo/01-setup.py#L108-L112

If you use OE then things should work fine

pavankum commented 3 years ago

Thank you @SimonBoothroyd, sorry forgot about that. OE fails for few records that have implicit hydrogens such as '[NH+:1]([C@:2]([C:3](=[O:4])[O-:10])([C:5]([C:6]([C:7]([C:8]([NH+:9]([H:22])[H:23])([H:20])[H:21])([H:18])[H:19])([H:16])[H:17])([H:14])[H:15])[H:13])([H:11])[H:12]'. May be I will download data using rdkit, save to file and parse the collection from it and continue rest of the workflow with OE. This was the issue

SimonBoothroyd commented 3 years ago

Hmm can we not just exclude such molecules?

May be I will download data using rdkit, save to file and parse the collection from it and continue rest of the workflow with OE

I'm not 100% sure I follow this - could we not just use OE for everything?

pavankum commented 3 years ago

results collection builds molecules and OE fails when it encounters those implicit hydrogen smiles, can I exclude them while downloading the dataset?

pavankum commented 3 years ago

I guess here, while inchikey is generated, https://github.com/openforcefield/openff-qcsubmit/blob/4a239bfe606b541b4088a0f95da252ed21526197/openff/qcsubmit/results/results.py#L446-L454