SMILES shown in output is wrong

okbckim commented 9 months ago

Description

I ran openff-bespoke in command with the compound in the right panel. But openff-bespoke output file show wrong SMILES as shown in the left panel. The correct SMILES is O=C(Nc1c[nH]nc1-c1ccccn1)c1ccn(C2CCCC2)n1 and output SMILES is O=C(Nc1cnc1-c1ccccn1)c1ccn(C2CCCC2)n1. Note that [nH] is missing in the wrong SMILES.

Please describe what you were trying to do, what you expected to happen, and what happened instead. Use the correct SMILES

Reproduction

BEFLOW_OPTIMIZER_KEEP_FILES=True openff-bespoke executor run --n-fragmenter-workers 2 --n-optimizer-workers 2  --n-qc-compute-workers 2 --qc-compute-n-cores 1 --file Z1559977118.sdf --output Z1559977118.sdf.json --output-force-field Z1559977118.sdf.offxml --workflow "default" --default-qc-spec  xtb gfn2xtb none

Output

┏━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ ID ┃ SMILES                                ┃ NAME        ┃ FILE            ┃
┡━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ 1  │ O=C(Nc1cnc1-c1ccccn1)c1ccn(C2CCCC2)n1 │ Z1559977118 │ Z1559977118.sdf │

Software versions

openff-bespokefit 0.2.3

Which operating system and version are you using?

Ubuntu

How did you install BespokeFit?

as in document

Are you using Apple Silicon? If so, are you running BespokeFit in Rosetta (osx-64) or natively (osx-arm64)?

No
What is the output of running conda list?

Output of conda list


Please place the output of `conda list` here
openff-amber-ff-ports     0.0.4              pyhca7485f_0    conda-forge
openff-bespokefit         0.2.3              pyhd8ed1ab_0    conda-forge
openff-forcefields        2023.11.0          pyhca7485f_0    conda-forge
openff-fragmenter-base    0.2.0              pyhd8ed1ab_0    conda-forge
openff-interchange        0.3.10             pyhd8ed1ab_0    conda-forge
openff-interchange-base   0.3.10             pyhd8ed1ab_0    conda-forge
openff-models             0.1.1              pyhca7485f_0    conda-forge
openff-qcsubmit           0.5.0              pyhd8ed1ab_0    conda-forge
openff-toolkit            0.13.2             pyhd8ed1ab_0    conda-forge
openff-toolkit-base       0.13.2             pyhd8ed1ab_0    conda-forge
openff-units              0.2.1              pyh1a96a4e_0    conda-forge
openff-utilities          0.1.12             pyhd8ed1ab_0    conda-forge

j-wags commented 9 months ago

Thanks for the report. I'm working on reproducing this. The report didn't include the input file, could you add that if possible? I'm assuming it's https://zinc.docking.org/substances/ZINC000156786881/ but this page doesn't offer an SDF file download, so I may need to know how you created it to help you here.

j-wags commented 9 months ago

Something seems off here. Both SMILESes reuse the 1 ring-closing index several times, which I've never seen before. And assuming the input is the ZINC molecule from my last post, there shouldn't be any 4-membered rings. This may be related to the ring-closing index reuse. I don't think I can move this forward without more details about the input.

okbckim commented 9 months ago

Thank you so much for looking into it. Initially I couldn't upload sdf format in this forum. I compressed it in gzip complying to file format for uploading. This compound is from enamine, here is the link, https://enaminestore.com/catalog/Z1559982889 Z1559982889.sdf.gz

mattwthompson commented 9 months ago

Hard to isolate it to the toolkit's default behavior, but then again I can't get anything to pass isomorphism


In [1]: from openff.toolkit import Molecule

In [2]: smiles1 = "O=C(Nc1c[nH]nc1-c1ccccn1)c1ccn(C2CCCC2)n1"

In [3]: smiles2 = "O=C(Nc1cnc1-c1ccccn1)c1ccn(C2CCCC2)n1"

In [4]: Molecule.from_file("Z1559982889.sdf").is_isomorphic_with(Molecule.from_smiles(smiles1))
Out[4]: False

In [5]: Molecule.from_file("Z1559982889.sdf").is_isomorphic_with(Molecule.from_smiles(smiles2))
Out[5]: False

In [6]: Molecule.from_file("Z1559982889.sdf").to_smiles(explicit_hydrogens=False)
Out[6]: 'c1ccnc(c1)C2=NNC=C2NC(=O)C3=C4C=CC=CN4N=C3'  # or O=C(Nc1c[nH]nc1-c1ccccn1)c1cnn2ccccc12 with RDKit

j-wags commented 9 months ago

The molecule in @okbckim's second post (Z1559982889) is different than the one in the first (Z1559977118). I've run the second molecule and verified that it does have the same problem, where the molecule in the table output is clearly different than the input.

Output of running the second molecule


(bespokefit) jw@mba$ BEFLOW_OPTIMIZER_KEEP_FILES=True openff-bespoke executor run --n-fragmenter-workers 2 --n-optimizer-workers 2  --n-qc-compute-workers 2 --qc-compute-n-cores 1 --file Z1559982889.sdf --output 1.json --output-force-field 1.offxml --workflow "default" --default-qc-spec  xtb gfn2xtb none 

──────────────────────────────── OpenFF Bespoke ────────────────────────────────

[✓] bespoke executor launched

1. preparing the bespoke workflow                                               

[✓] 1 molecules found
[✓] fitting schemas generated

2. submitting the workflow                                                      

[✓] the following workflows were submitted
┏━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ ID ┃ SMILES                             ┃ NAME        ┃ FILE            ┃
┡━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ 1  │ O=C(Nc1cnc1-c1ccccn1)c1cnn2ccccc12 │ Z1559982889 │ Z1559982889.sdf │
└────┴────────────────────────────────────┴─────────────┴─────────────────┘

3. running the fitting pipeline                                                 

[✓] fragmentation successful
[✓] qc-generation successful
[✓] optimization successful

outputs have been saved to 1.json                                               

the bespoke force field has been saved to 1.offxml                              

worker: Warm shutdown (MainProcess)

worker: Warm shutdown (MainProcess)

worker: Warm shutdown (MainProcess)

The SMILES in the output has a 4-membered ring, which isn't present in the input mol. And it also shows the re-use of the 1 ring-closing index, which I suspect is related.

My next two questions are:

Is the final output still correct (and this issue is just cosmetic)?
How do we fix the issue, even if it is cosmetic?

jthorton commented 9 months ago

This sure is a strange one the code to print the smiles is here, Ill dig back some more to see if we do anything to the molecule before printing the smiles.

j-wags commented 9 months ago

I've confirmed that the output FF does have all its bespokefit-created torsion parameters applied to the input molecule. So it's unlikely that the actual parameters are being fit to mangled chemistry. The new torsion parameters in the file are t169-t184, and I see them all appearing when I inspect parameter application using the output FF.

from openff.toolkit import Molecule, ForceField
ff = ForceField('1.offxml')
mol = Molecule.from_file("Z1559982889.sdf")
interchange = ff.create_interchange(mol.to_topology())
seen_params = set()
for key in interchange.collections["ProperTorsions"].get_mapping().keys():
    seen_params.add(ff["ProperTorsions"][key.id].id)
print(sorted(list(seen_params)))

['t138', 't169', 't170', 't171', 't172', 't173', 't174', 't175', 't176', 't177', 't178', 't179', 't180', 't181', 't182', 't183', 't184', 't43', 't44', 't45', 't47', 't73', 't80', 't84', 't85', 't86']

My files are here: bespokefit-319.tar.gz

So I'm going to move forward assuming that this closes the book on question 1. I'll look into how to fix 2 next.

okbckim commented 9 months ago

Thank you very much for thoroughly investigating this issue. Very good to hear it is just cosmetic issue.

j-wags commented 9 months ago

Ha, rich's formatting modifies the string before printing it out to table.

import rich
import rich.table
table = rich.table.Table()
table.add_column("SMILES")

table.add_row('O=C(Nc1c[nH]nc1-c1ccccn1)c1cnn2ccccc12') # Original
table.add_row('O=C(Nc1cnHnc1-c1ccccn1)c1cnn2ccccc12') #Without brackets around nH
table.add_row('O=C(Nc1c\[nH]nc1-c1ccccn1)c1cnn2ccccc12') # With escaped brackets around nH

console = rich.get_console()
console.print(table)

 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ SMILES                                 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ O=C(Nc1cnc1-c1ccccn1)c1cnn2ccccc12     │
│ O=C(Nc1cnHnc1-c1ccccn1)c1cnn2ccccc12   │
│ O=C(Nc1c[nH]nc1-c1ccccn1)c1cnn2ccccc12 │
└────────────────────────────────────────┘

So next I need to figure out how to get rich print the string exactly as it's given.

j-wags commented 9 months ago

rich.markup.escape(string) seems to do the trick. I'll open a PR.

Any tips for the easiest way to test for this change?

mattwthompson commented 9 months ago

Something in here could probably be frankensteined with mocking to capture the console after submitting a molecule: https://github.com/openforcefield/openff-bespokefit/blob/3f5e73edeaebd2361a76e8a0b4dcd0641e5a7342/openff/bespokefit/tests/cli/executor/test_submit.py#L324-L360

openforcefield / openff-bespokefit

SMILES shown in output is wrong #319