openforcefield / openff-benchmark

Comparison benchmarks between public force fields and Open Force Field Initiative force fields
MIT License
11 stars 2 forks source link

failing rdMolAlign.GetBestRMS in analysis #103

Open ldamore opened 2 years ago

ldamore commented 2 years ago

rdMolAlign.GetBestRMS fails in some cases when computing the RMSD between a QM optimized and an OPLS optimized molecule. This happens for all the analysis computing the rmsd, see e.g. for swope analysis:

Traceback (most recent call last):,  5.83s/it]
  File "/users/home/ldamored/miniconda/envs/off-analysfix/bin/openff-benchmark", line 8, in <module>
    sys.exit(cli())449it [2:30:31,  4.16s/it]
  File "/users/home/ldamored/miniconda/envs/off-analysfix/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/users/home/ldamored/miniconda/envs/off-analysfix/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/users/home/ldamored/miniconda/envs/off-analysfix/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/users/home/ldamored/miniconda/envs/off-analysfix/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/users/home/ldamored/miniconda/envs/off-analysfix/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/users/home/ldamored/miniconda/envs/off-analysfix/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/users/home/ldamored/miniconda/envs/off-analysfix/lib/python3.7/site-packages/openff/benchmark/cli.py", line 887, in swope
    analysis.swope(input_path, ref_method, output_directory)
  File "/users/home/ldamored/miniconda/envs/off-analysfix/lib/python3.7/site-packages/openff/benchmark/analysis/analysis.py", line 284, in swope
    row['mol'].to_rdkit(), qm_df.loc[qm_min, 'mol'].to_rdkit())
RuntimeError: No sub-structure match found between the reference and probe mol

A try...except was added to the code in order to keep track of the offending molecules: e.g.

for i, row in tqdm(reference.iterrows(), desc='Calculating RMSD'):
    try:
        result.loc[i, 'rmsd'] = rdMolAlign.GetBestRMS(row['mol'].to_rdkit(), result.loc[i, 'mol'].to_rdkit())
    except RuntimeError:
        result.loc[i, 'rmsd'] = np.NaN
        print(f"Unable to calculate best RMSD between {ref_name} and {result_name}; conformer `{i}`")

In the OpenFF Public dataset this led to the identification of the following offending molecules:

Unable to calculate best RMSD between b3lyp-d3bj and opls4_default; conformer GNT-00927-00, GNT-00927-03, GNT-00927-02, GNT-00927-01, GNT-00927-04, GNT-00927-05, GNT-00245-01, GNT-00245-03, GNT-00927-06, GNT-00245-00, GNT-00245-,2, GNT-00927-07, GNT-00245-04, GNT-00245-05, GNT-00855-02, GNT-00855-03, GNT-00855-04, GNT-00855-05, GNT-00855-01, GNT-00855-06, GNT-00855-07, GNT-00855-00, GNT-00691-01, GNT-00691-02, GNT-00691-00, GNT-00691-,6, GNT-00691-05, GNT-00691-03, GNT-00691-04, GNT-00691-07, GNT-00691-08, RCH-00138-00, RCH-00138-01, RCH-00138-04, RCH-00138-07, RCH-00138-02, RCH-00138-03, RCH-00138-08, RCH-00138-05, RCH-00138-09, RCH-00138-,6, RCH-00935-00, RCH-00935-01, RCH-00935-04, RCH-00935-05, RCH-00935-02, RCH-00935-03, RCH-00935-06, RCH-00935-07, RCH-00294-00, RCH-00294-01, RCH-00294-03, RCH-00294-02, RCH-00294-04, RCH-00294-06, RCH-00294-,5, RCH-00643-00, RCH-00643-02, RCH-00643-04, RCH-00643-03, RCH-00643-01, RCH-00643-05, RCH-00643-06, RCH-00643-08, RCH-00643-07, RCH-00643-09, RCH-00324-02, RCH-00324-03, RCH-00324-04, RCH-00324-06, RCH-00324-,5, RCH-00324-07, RCH-00259-04, RCH-00324-05, RCH-00324-07, RCH-00259-04, RCH-01263-02, RCH-01263-01, RCH-01263-00, RCH-01263-04, RCH-01263-03, RCH-01263-05, RCH-01263-07, RCH-01263-06, RCH-00298-01, RCH-00298-,2, RCH-00298-05, RCH-00298-00, RCH-00298-04, RCH-00298-03, RCH-00298-06, RCH-00298-09, RCH-00298-07, RCH-00298-08

By visual inspection of these molecules it appears that there is a different connectivity arrangement between the QM and the OPLS molecule (and more in general between all the molecules optimized with openff-benchmark and the corresponding OPLS optimized molecules).

See e.g. GNT-00927-00 GNT-00927-00 left side: QM mol; right side: opls-optimized mol; it appears that the sulfonate group in the QM mol is [O-S-O], whereas for OPLS is [O=S=O].

Using rdMolAlign.GetBestRMS(MolStandardize.rdMolStandardize.Normalize(rdkitmol) solves the issue for the sulfonate group but the issue still persist in other offending molecules, so it doesn't seem to be a permanent solution.

try:
    result.loc[i, 'rmsd'] = rdMolAlign.GetBestRMS(MolStandardize.rdMolStandardize.Normalize(row['mol'].to_rdkit()),
                                     MolStandardize.rdMolStandardize.Normalize(result.loc[i, 'mol'].to_rdkit())
                                                                                             )
except RuntimeError:
    result.loc[i, 'rmsd'] = np.NaN
    print(f"Unable to calculate best RMSD between {ref_name} and {result_name}; conformer `{i}`")

See e.g. GNT-00245-00, where the connectivity issue persists:

GNT-00245-00

left side (QM mol) has an [O-C=N], whereas in the right side (opls) there is [O=C-N]

dotsdl commented 2 years ago

Excellent, thank you for writing this up @ldamore! We don't have a solution for this as of yet, but this issue will give us a space to address it in time.

ldamore commented 2 years ago

meanwhile I found this, worth to have it here:

https://github.com/greglandrum/RSC_OpenScience_Standardization_202104/blob/main/MolStandardize%20pieces.ipynb