openforcefield / openff-toolkit

The Open Forcefield Toolkit provides implementations of the SMIRNOFF format, parameterization engine, and other tools. Documentation available at http://open-forcefield-toolkit.readthedocs.io
http://openforcefield.org
MIT License
311 stars 90 forks source link

Add documentation for preparing PDB ligand (without Hs) + SMILES for simulation #717

Open j-wags opened 4 years ago

j-wags commented 4 years ago

Is your feature request related to a problem? Please describe.

I worked with a user to take a H-less ligand from the PDB and combine it with the information in the PDB ligand's SMILES to produce a fully-detailed representation of the molecule. This workflow for adding hydrogens is probably a super common use case. We should try to tie it into an example. I'll try to do this for the Toolkit Showcase example I'm working towards.

# Add Hs to a ligand for which we have a PDB structure without hydrogens 
# (or with unreliable hydrogens) and a SMILES

from rdkit import Chem
from rdkit.Chem import AllChem
import copy

mol = Chem.MolFromPDBFile('6W63.pdb')
lig_resname = 'X77'
mol_split = Chem.rdmolops.SplitMolByPDBResidues(moål)
lig_mol_wo_bond_orders = copy.deepcopy(mol_split[lig_resname])
lig_mol_wo_bond_orders_wo_Hs = Chem.RemoveHs(lig_mol_wo_bond_orders)
smi_mol = Chem.MolFromSmiles('CC(C)(C)c1ccc(cc1)N([C@@H](C(=O)NC2CCCCC2)c3cccnc3)C(=O)c4c[nH]cn4')
lig_mol = AllChem.AssignBondOrdersFromTemplate(smi_mol, lig_mol_wo_bond_orders_wo_Hs)
lig_mol.AddConformer(lig_mol_wo_bond_orders.GetConformer(0))
lig_mol_H = Chem.rdmolops.AddHs(lig_mol, addCoords=True)
Chem.MolToMolFile(lig_mol_H, 'X77.sdf')
yunhuige commented 4 years ago

A quick question: When adding hydrogens Chem.rdmolops.AddHs , does it consider protonation states of the ligand? Meanwhile, if I can predict pKa values using some tools (e.g. propka), how can I incorporate those information into this code? Thank you!

j-wags commented 4 years ago

does it consider protonation states of the ligand?

In short, "no".

The code snippet above will remove existing hydrogens (if any), and add them back according to the bond orders specified in the SMILES. The hydrogens that are added are considered "implicit" in the SMILES (as in, the number/placement of them can be unambiguously derived from the connectivity in the SMILES).

Meanwhile, if I can predict pKa values using some tools (e.g. propka), how can I incorporate those information into this code?

For things like a carbonyl group, you can specify that it should be protonated by writing it as C(O)(=O) or should not be protonated by writing it as C(O)([O-1]). To specify the states of an amine, you can write something like CN for uncharged (NH2) and C[N+1] for charged (NH3+)

There may be a more elegant way to get propka to write out a final SMILES/sdf/mol2 that can be read in, but I'm not very familiar with it. Does anyone else know?

yunhuige commented 4 years ago

@j-wags Okay, so in my case I got my SMILES from the PDB website. After I used propka for predicting pKa values, I found the carbonyl and the amine groups on the ligand should be charged. This result apparently disagrees with the SMILES on the website. I don't know if I should trust propka or the original SMILES in this case. Do you have any suggestions? So now I think I shouldn't trust the SMILES on the website. Then maybe a workaround could be running propka to predict pKa values first and modify the SMILES based on that. Then using the outcome SMILES in your scripts above. How about that?

davidlmobley commented 4 years ago

@yunhuige SMILES don't have a 1:1 correspondence to molecules -- it's many:1, that is, many SMILES correspond to the same molecule.

The way to check that you have A correct SMILES is to convert the SMILES you start with into a molecule (e.g. an OEMol), then create a canonical SMILES for it. Then convert the other SMILES you have into a molecule in teh same way, and create a canonical SMILES for that. Then check whether the two SMILES are the same. Any given toolkit can have 1:1 correspondence between SMILES:molecule, but SMILES get canonicalized in different ways by different packages.

Or, see our cmiles code which attempts to solve this problem.

j-wags commented 4 years ago

So now I think I shouldn't trust the SMILES on the website

Agreed. I had focused so much on the "how" of assigning its bond orders/protonation state that I forgot to ask "why". The protein data bank doesn't have a better idea about that ligand's protonation state than we do, so you're correct that the SMILES from there should not necessarily be considered "correct".

Then maybe a workaround could be running propka to predict pKa values first and modify the SMILES based on that. Then using the outcome SMILES in your scripts above. How about that?

That sounds great. And, actually, propka is pip-installable, so if we found a way to automate getting a SMILES of its most likely protonation state, we could fold that into one of our examples so other users could benefit :-)

yunhuige commented 4 years ago

That sounds great. And, actually, propka is pip-installable, so if we found a way to automate getting a SMILES of its most likely protonation state, we could fold that into one of our examples so other users could benefit :-)

I think PropKa is not accurate for small molecules yet. In my case, it gave me wrong pKa prediction. Instead, OpenEye tools may be more reliable in this case. That said, we can still use your scripts to generate the molecule from a SMILES. Once we have the outcome SDF file then we can use OpenEye tools to determine protonation state, re-add hydrogens, determine connectivity and perceive bond orders:

from openeye import oechem # OpenEye Python toolkits
from openeye import oequacpac #Charge toolkit
istream = oechem.oemolistream(INPUT_SDF)
mol = oechem.OEMol()
oechem.OEReadMolecule(istream, mol)
# Set to use a simple neutral pH model 
oequacpac.OESetNeutralpHModel(mol)
oechem.OEAddExplicitHydrogens(mol)
oechem.OEDetermineConnectivity(mol)
oechem.OEPerceiveBondOrders(mol)
ostream = oechem.oemolostream(OUTPUT_SDF)
oechem.OEWriteMolecule(ostream, mol)

Then we can use the saved SDF file for parametrization. This should work for "simple" cases. We may need to proceed with extra caution for more complicated ligands though. Also, this should work if the input and output are SDF files. For other file formats, OpenEye tools may not correctly re-add H atoms when the input file has H atoms already (as it is in this case from your scripts).