openmm / openmmforcefields

CHARMM and AMBER forcefields for OpenMM (with small molecule support)
http://openmm.org
Other
248 stars 80 forks source link

Trying to use GAFF. How do I generate a SMILES code so I can create the molecule? #257

Closed razvanmarinescu closed 1 year ago

razvanmarinescu commented 1 year ago

Hi all,

I've been using MD for some time now on proteins only, but now I'm trying to run it with a ligand. I'm trying to use the GAFF force field to simulate the binding of HAP-18, a virus capsid modulator. How do I generate a SMILES code (or find an sdf file / OpenEye / etc ...) for it so I can instantiate the Molecule object and follow the example in the README:

from openff.toolkit.topology import Molecule molecule = Molecule.from_smiles('c1ccccc1')

Create the GAFF template generator

from openmmforcefields.generators import GAFFTemplateGenerator gaff = GAFFTemplateGenerator(molecules=molecule)

This is the molecule (HAP-18 - heteroaryldihydropyrimidine-18) I'm interested in: image

It's described in this article: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4810570/?report=classic

Is there a database of such molecules where I can find the SMILE code, or an sdf file for it? Thanks!

mattwthompson commented 1 year ago

If the authors do not report SMILES or other substitute identifiers (InChi, IUPAC, etc.) for ligands or you cannot find one, you can fall back to a tool like PubChem's drawer to go from a 2D structure to SMILES etc. If you need to make other conversions, a tool like RDKit is probably the best starting point.

Note the stereochemistry of one of those biaryl bonds.

peastman commented 1 year ago

They could have made that easier for you! In the paper, they give the PDB ID for their crystal structure: 5D7Y. So I looked it up and found the ligand has residue ID 58H. I looked that ID up on Ligand Expo, and found it lists several different SMILES and InChI descriptions for it.

58H SMILES           ACDLabs              12.01 "C(C(C4CCN(CC1=C(C(OC)=O)C(N=C(N1)c2ccccn2)c3c(Cl)cc(F)cc3)CC4)(O)C#C)#C"                                                                                                                            
58H InChI            InChI                1.03  "InChI=1S/C28H26ClFN4O3/c1-4-28(36,5-2)18-11-14-34(15-12-18)17-23-24(27(35)37-3)25(20-10-9-19(30)16-21(20)29)33-26(32-23)22-8-6-7-13-31-22/h1-2,6-10,13,16,18,25,36H,11-12,14-15,17H2,3H3,(H,32,33)" 
58H InChIKey         InChI                1.03  GKZRJYVILONFJM-UHFFFAOYSA-N                                                                                                                                                                          
58H SMILES_CANONICAL CACTVS               3.385 "COC(=O)C1=C(CN2CC[C@@H](CC2)C(O)(C#C)C#C)NC(=N[C@H]1c3ccc(F)cc3Cl)c4ccccn4"                                                                                                                         
58H SMILES           CACTVS               3.385 "COC(=O)C1=C(CN2CC[CH](CC2)C(O)(C#C)C#C)NC(=N[CH]1c3ccc(F)cc3Cl)c4ccccn4"                                                                                                                            
58H SMILES_CANONICAL "OpenEye OEToolkits" 1.9.2 "COC(=O)C1=C(NC(=NC1c2ccc(cc2Cl)F)c3ccccn3)CN4CCC(CC4)C(C#C)(C#C)O"                                                                                                                                  
58H SMILES           "OpenEye OEToolkits" 1.9.2 "COC(=O)C1=C(NC(=NC1c2ccc(cc2Cl)F)c3ccccn3)CN4CCC(CC4)C(C#C)(C#C)O"                                                                                                                                  

Not being a chemist, I'm not sure what the differences between them are. They also have a SDF file for it, if that's easier.

razvanmarinescu commented 1 year ago

Thank you @mattwthompson. @peastman, that is exactly what I was looking for, thank you so much!

Do you reckon that those SMILES results from Ligand Expo are all equivalent? Or there would actually be important differences between them?

I'll likely try with the sdf file first in this case.

peastman commented 1 year ago

Using RDKit to canonicalize them gives identical results for three of them, but a different string for the fourth:

smiles = [
    'C(C(C4CCN(CC1=C(C(OC)=O)C(N=C(N1)c2ccccn2)c3c(Cl)cc(F)cc3)CC4)(O)C#C)#C',
    'COC(=O)C1=C(CN2CC[C@@H](CC2)C(O)(C#C)C#C)NC(=N[C@H]1c3ccc(F)cc3Cl)c4ccccn4',
    'COC(=O)C1=C(CN2CC[CH](CC2)C(O)(C#C)C#C)NC(=N[CH]1c3ccc(F)cc3Cl)c4ccccn4',
    'COC(=O)C1=C(NC(=NC1c2ccc(cc2Cl)F)c3ccccn3)CN4CCC(CC4)C(C#C)(C#C)O']
from rdkit import Chem
for s in smiles:
    print(Chem.MolToSmiles(Chem.MolFromSmiles(s)))

Here's the output:

C#CC(O)(C#C)C1CCN(CC2=C(C(=O)OC)C(c3ccc(F)cc3Cl)N=C(c3ccccn3)N2)CC1
C#CC(O)(C#C)C1CCN(CC2=C(C(=O)OC)[C@H](c3ccc(F)cc3Cl)N=C(c3ccccn3)N2)CC1
C#CC(O)(C#C)C1CCN(CC2=C(C(=O)OC)C(c3ccc(F)cc3Cl)N=C(c3ccccn3)N2)CC1
C#CC(O)(C#C)C1CCN(CC2=C(C(=O)OC)C(c3ccc(F)cc3Cl)N=C(c3ccccn3)N2)CC1

The difference is that C is replaced by [C@H]. I'm not sure whether that's a real difference, or whether it's just clarifying the stereochemistry.

razvanmarinescu commented 1 year ago

Oh I see, that is great! I'll then use those 3 equivalent ones.

jchodera commented 1 year ago

@razvanmarinescu : You have to be a bit careful here---the PDB file you feed to OpenMM must have the explicitly protonated ligand (while PDB files you obtain from the Protein Data Bank generally only contains heavy atoms, not protons) and the SMILES string you use to construct the ligand must have exactly the same explicit protonation state. It may be easier to delete the ligand in the PDB file and replace it with the explicitly protonated PDB version downloadable from the Ligand Expo 'downloads' page that has been prepared by OpenEye. You can also load the exact same version (as PDB or SDF) into an OpenFF Molecule object and pass that to the GAFF residue template generator to make sure their protonation states match.

@peastman : Thanks for the help!

razvanmarinescu commented 1 year ago

Hi @jchodera, thank you so much for the tip. So the SMILES strings we have above are non-protonated, while OpenMM needs a protonated SMILES to do the simulation.

Once I download the file with the protonated ligand, how do I place it at the exact (bound) position where it's meant to be? Do I align it based on RMSD? Or is there a better way?

Thanks!

peastman commented 1 year ago

Hydrogens are usually implicit in SMILES strings. Tools like RDKit and OpenFF Toolkit can add them automatically. See, for example, https://docs.openforcefield.org/projects/toolkit/en/stable/api/generated/openff.toolkit.topology.Molecule.html#openff.toolkit.topology.Molecule.from_rdkit.

Since no one else has answered your other question I'll try, though I don't really have a lot of experience at this sort of thing. Hopefully someone else can provide a better answer.

Figuring out where a ligand binds is a complicated scientific question. The best case is if you have a crystal structure showing exactly where it binds. If you don't have that but you have a good idea where the binding site is, you could try positioning it approximately, minimize energy, and then run some dynamics to let it settle into place. And of course there is a whole field of docking algorithms that try to figure out the binding site automatically if you don't already know it.

It also depends on what you're trying to learn. In many cases you expect the ligand to repeatedly bind and unbind, and that's what you want to study: how quickly does it bind, how long does it stay bound, etc. In that case, it's ok if it doesn't start out in the right position. You want to see how quickly it gets there.

razvanmarinescu commented 1 year ago

@peastman thank a lot, that all makes sense. I think aligning based on RMSD would be enough.

My main issue now is that I have trouble building the GAFF force field, see my other issue that I raised. Can you have a quick look on that code, would really appreciate any help! I've been trying to fix it but cannot figure it out.