`PDBFixer.addSolvent` often incorrectly neutralises systems with charged ligands as though they were neutral

Yoshanuikabundi commented 1 month ago

The addSolvent method uses a force field to assign charges, which it then uses to add ions that neutralise the net charge of the box. Unfortunately, the force field seems to assign formal charges of zero to many charged groups in small molecules; for example the citrate ion, which has a -3 formal charge:

from openff.toolkit import Molecule

top = Molecule.from_smiles("C(C(=O)[O-])C(CC(=O)[O-])(C(=O)[O-])O").to_topology()
top.molecule(0).generate_conformers(n_conformers=1)
top.to_file("citrate.pdb")

from pdbfixer import PDBFixer
import openmm

fixer = PDBFixer("citrate.pdb")

fixer.addSolvent(
    padding=2.0 * openmm.unit.nanometer,
    ionicStrength=0.1 * openmm.unit.molar,
    boxShape="dodecahedron",
)

counts = {"NA": 0, "UNK": 0, "CL": 0, "HOH": 0}
for residue in fixer.topology.residues():
    counts[residue.name] += 1
counts # 
{'NA': 3, 'UNK': 1, 'CL': 3, 'HOH': 1457}

The citrate.pdb file generated above is available here: citrate.pdb.zip

With the current main branch (commit 92044f3), you can also see this behavior with a PDB file like 1PO0, which has 2 citrate ions (each with -3 formal charge).

from pdbfixer import PDBFixer
import openmm

fixer = PDBFixer("1PO0.pdb")

fixer.findMissingResidues()

# This section adds caps; leave it commented to rebuild terminal loops
chains_to_cap = {chain for chain, resi in fixer.missingResidues}
for chainidx in chains_to_cap:
    chain = [*fixer.topology.chains()][chainidx]
    last_resi = len([*chain.residues()])
    fixer.missingResidues[chainidx, 0] = ['ACE']
    fixer.missingResidues[chainidx, last_resi] = ['NME']

fixer.findNonstandardResidues()
fixer.replaceNonstandardResidues()
# fixer.removeHeterogens(keepWater=True)
fixer.findMissingAtoms()
fixer.addMissingAtoms()

fixer.addMissingHydrogens(pH=7.4)

fixer.addSolvent(
    padding=2.0 * openmm.unit.nanometer,
    ionicStrength=0.1 * openmm.unit.molar,
    boxShape="dodecahedron",
)

charges = {
 'ASP': -1,
 'ARG': +1,
 'GLU': -1,
 'LYS': +1,
 'FLC': -3,
 'NA': 1,
 'CL': -1
} # I think this is right!
charge = 0
for residue in fixer.topology.residues():
    if residue.name in charges:
        charge += charges[residue.name]
charge # gives -6 (-3 for each citrate residue)

It would be great if the correct formal charges could be stored when the template is downloaded! Then they could be used when neutralising the system. This might also give an opportunity for users to manually specify the formal charges of UNK residues.

Thanks!

(also the ACE and NME residues in the second example above don't seem to be getting hydrogens - let me know if you'd like me to open a second issue for that!)

peastman commented 1 month ago

That's an interesting corner case I hadn't thought about. PDBFixer.addSolvent() is mostly just a thin wrapper around Modeller.addSolvent():

https://github.com/openmm/pdbfixer/blob/92044f3c524d21ee1e79f53c54882201ff2b5b34/pdbfixer/pdbfixer.py#L1358-L1364

Notice that Modeller lets you pass in a force field to use for identifying charges, but PDBFixer generates its own force field, which of course only knows about standard residues. Here are a few possibilities.

As a short term workaround, you could just copy out the code from PDBFixer.addSolvent(), but pass your own force field instead of having it generate one.
We could modify PDBFixer.addSolvent() to let you pass in a force field to use.
We could try to make _createForceField() generate charges for nonstandard residues based on their formal charges.

A complication with the third one is that it often doesn't know what the formal charges are. Registering a template is only required if you're going to add new atoms to a residue. If all you're doing is preserving a molecule from the input file, it will happily copy it over even if it has no idea what that molecule is. It doesn't need to know anything about it... except in this case, it kind of does because it affects how many ions get added.

Yoshanuikabundi commented 1 month ago

Thanks @peastman, that clarifies the issues really well for me! I might take a crack at option 3 in a fork - would you accept a PR if I can come up with a good solution?

peastman commented 1 month ago

If you can come up with a good solution, sure, but let's discuss it first. What do you have in mind for figuring out the charges?

If we can't figure out anything robust, I'd vote for the second one. It's easy to implement, and if you plan to simulate the system with OpenMM, you presumably have a force field that can parameterize it.

j-wags commented 1 month ago

I'd be happy to try using @Yoshanuikabundi's fork as a way to evaluate implementing Peter's option 3 via either taking formal charges from the CCD, or using the MDAnalysis chemical information guesser at scale. These would both depend on the user knowing their protonation state beforehand, so users could either use the already-implemented protonation functionality, or they could come with fully-protonated inputs.

Unfortunately, Peter's options 1 and 2 would hit a circular dependency in SMIRNOFF force field land. We can't make a FF for a molecule without already knowing its formal charges.

peastman commented 1 month ago

I think that's orthogonal to this issue? This is only about how to determine the total charge for purposes of adding counter ions. Parameterizing the system is much more complicated. And if you have a solution for parameterizing, then allowing a force field to be passed to addSolvent() should handle this case.

Yoshanuikabundi commented 1 month ago

Option 2 would be fine if we had an OpenMM force field that could charge, say, anything in the PDB... but unfortunately we don't. And as you point out, if we did have an appropriate force field we could just use Modeller to add solvent. The createForceField machinery in PDBFixer is part of what makes it such a flexible tool, which is why I think the best path forward is to improve its ability to assign a wide variety of correct formal charges (option 3). I'll also point out that PDBFixer is a tool that can be used for preparing systems for any MD engine, and it would be a shame to tie its full functionality to any specific force field format - in the current API, this relationship is entirely implicit.

I'd propose adding a method to the PDBFixer class that uses the existing machinery for collecting CCD templates to assign formal charges to all atoms. Formal charges could be stored as a list[int|None] on the PDBFixer object, or as a dict[int, int] if we expect formal charges to be sparsely populated and the size of the list to be an issue. This would also let users manually set formal charges for unknown ligands or alternate protonation states.

Then createForceField could use those formal charges when assigning its charges. This would avoid any unnecessary CCD lookup for users who do not require formal charges or solvation. When the CCD is looked up, formal charges could be set at that time to avoid a future duplicate lookup. createForceField could issue a warning if it has any ligands with unknown formal charges, or it could call the new API point to determine charges.

I think @j-wags point is that OpenFF does have a solution for parametrizing, but (a) it isn't in the form of an OpenMM force field, and (b) it requires formal charges.

I'm not familiar enough with the new code to know off the top of my head how much of a refactor the above is, but I'm happy to experiment in a fork!

peastman commented 1 month ago

Feel free to give it a try and see if you can work out something robust. Doing it based on the templates will have two main issues you have to overcome. The first, as noted above, is that we often don't have a template. Ligands may not be in the CCD, or the name may be nonstandard so we don't know what it is. The other is that formal charges depend on protonation. For example, here's the CCD template for LYS. Formal charge is the 5th column.

LYS N   N   N 0 1 N N N Y Y N 37.577 40.385 -3.968 1.422  1.796  0.198  N   LYS 1  
LYS CA  CA  C 0 1 N N S Y N N 38.631 39.459 -4.356 1.394  0.355  0.484  CA  LYS 2  
LYS C   C   C 0 1 N N N Y N Y 38.094 38.304 -5.212 2.657  -0.284 -0.032 C   LYS 3  
LYS O   O   O 0 1 N N N Y N Y 36.873 38.235 -5.490 3.316  0.275  -0.876 O   LYS 4  
LYS CB  CB  C 0 1 N N N N N N 39.374 38.919 -3.139 0.184  -0.278 -0.206 CB  LYS 5  
LYS CG  CG  C 0 1 N N N N N N 38.523 38.111 -2.181 -1.102 0.282  0.407  CG  LYS 6  
LYS CD  CD  C 0 1 N N N N N N 39.164 36.749 -1.903 -2.313 -0.351 -0.283 CD  LYS 7  
LYS CE  CE  C 0 1 N N N N N N 38.106 35.761 -1.382 -3.598 0.208  0.329  CE  LYS 8  
LYS NZ  NZ  N 1 1 N N N N N N 37.176 36.546 -0.539 -4.761 -0.400 -0.332 NZ  LYS 9  
LYS OXT OXT O 0 1 N Y N Y N Y 38.961 37.678 -5.886 3.050  -1.476 0.446  OXT LYS 10 
LYS H   H   H 0 1 N N N Y Y N 37.933 41.152 -3.399 1.489  1.891  -0.804 H   LYS 11 
LYS H2  HN2 H 0 1 N Y N Y Y N 36.812 39.900 -3.498 0.521  2.162  0.464  H2  LYS 12 
LYS HA  HA  H 0 1 N N N Y N N 39.352 40.037 -4.979 1.322  0.200  1.560  HA  LYS 13 
LYS HB2 1HB H 0 1 N N N N N N 40.262 38.326 -3.460 0.210  -0.047 -1.270 HB2 LYS 14 
LYS HB3 2HB H 0 1 N N N N N N 39.882 39.750 -2.596 0.211  -1.359 -0.068 HB3 LYS 15 
LYS HG2 1HG H 0 1 N N N N N N 38.317 38.670 -1.238 -1.128 0.050  1.471  HG2 LYS 16 
LYS HG3 2HG H 0 1 N N N N N N 37.474 38.007 -2.546 -1.130 1.363  0.269  HG3 LYS 17 
LYS HD2 1HD H 0 1 N N N N N N 39.701 36.351 -2.795 -2.287 -0.120 -1.348 HD2 LYS 18 
LYS HD3 2HD H 0 1 N N N N N N 40.034 36.831 -1.210 -2.285 -1.432 -0.145 HD3 LYS 19 
LYS HE2 1HE H 0 1 N N N N N N 37.593 35.194 -2.194 -3.625 -0.023 1.394  HE2 LYS 20 
LYS HE3 2HE H 0 1 N N N N N N 38.544 34.882 -0.854 -3.626 1.289  0.192  HE3 LYS 21 
LYS HZ1 1HZ H 0 1 N N N N N N 36.474 35.891 -0.193 -4.736 -0.185 -1.318 HZ1 LYS 22 
LYS HZ2 2HZ H 0 1 N N N N N N 37.644 37.064 0.203  -4.735 -1.400 -0.205 HZ2 LYS 23 
LYS HZ3 3HZ H 0 1 N N N N N N 36.774 37.350 -1.021 -5.609 -0.031 0.071  HZ3 LYS 24 
LYS HXT HXT H 0 1 N Y N Y N Y 38.628 36.963 -6.415 3.861  -1.886 0.115  HXT LYS 25

It lists a +1 charge on NZ, and no charges anywhere else. But at high pH NZ can be neutral, and terminal residues can be charged. So you can't assume the charges listed in the template are correct for the current model.

openmm / pdbfixer

`PDBFixer.addSolvent` often incorrectly neutralises systems with charged ligands as though they were neutral #299