openmm / pdbfixer

PDBFixer fixes problems in PDB files
Other
498 stars 117 forks source link

Neutralise systems with formal charges from the PDB #301

Closed Yoshanuikabundi closed 2 months ago

Yoshanuikabundi commented 4 months ago

This PR addresses #299 by:

The refactor avoids re-downloading a CCD entry up to 3 times (once for protonation, once for missing heavy atoms, and a third time for formal charge) and reduces duplicate code. However, it is not essential for the other functionality and can be removed or split into its own PR if desired.

This solution works around the issues discussed in #299 where CCD residues tend to hard-code a single protonation state by only using the CCD formal charge when the force field does not provide the residue.

My reasoning is that there are two cases: first, when the force field includes a residue, it is capable of identifying the protonation state and gives the correct net charge and so it can only introduce regressions to interfere with this. In the second case, when a residue is introduced to the force field by _createForceField(water=True), the atom's partial charge is currently hard-coded to 0 and so using more information to sometimes set that to better values is appropriate.

We can break down the effect of this change (modulo bugs) like so:

  1. If an atom and residue is in amber14-all.xml, behavior is unchanged. Otherwise:
  2. If an atom and residue is in the CCD and the PDB file is consistent with that definition and the atom is charged, this change corrects solvent neutralisation.
  3. If a residue is in the CCD, and the PDB file re-uses identical residue and atom names for the entire residue, and that residue is neutral in the PDB but has a formal charge in the CCD, a regression is introduced. I expect this is very rare.
  4. Otherwise, behavior is unchanged.

I think 2. should be much more common than 3., so this is a step in the right direction. In particular, in the very common case that PDBFixer is used to protonate a residue (with addMissingHydrogens()), this change guarantees that the protonation state will be consistent with the assumed charge - either both will be taken from the CCD, or the assumed charge will be inferred from the protonation state by the force field.

I think something like an offsetCharge argument to addSolvent() and addMembrane() which would allow a user to correct any remaining errors in neutralising charge would be great, but this requires changes to Modeller and so is not included in this PR.

Yoshanuikabundi commented 3 months ago

Sorry for the delay in getting back to you. Your implementation looks good, and as you say, it's probably better than not doing it.

Yay!

I'm still a little concerned about residues that can appear in multiple forms. Take a look at the CCD definition of HIS, for example. (I know it's a standard residue, and therefore this code won't be used for it. I'm just using it as an example of what we can probably expect to find in nonstandard residues too.) It can be positive, negative, or neutral depending on which hydrogens are present. The CCD entry describes the positively charged, doubly protonated form. But that isn't the most common form at pH 7.

An example of this is the CCD definition of CIT, citric acid. At pH 7, citric acid is most commonly triply deprotonated, but the CCD entry is neutral. I agree it would be nice to handle this case more carefully, both here and in the protonation code.

I can imagine some simple algorithm along the lines of "add 1 formal charge to any atom that has an extra hydrogen bonded to it relative to the CCD, and remove 1 formal charge from any atom that is missing a hydrogen" which might improve the behavior in this PR for "nonstandard" protonation states. This would have pathological behavior on a PDB file that is missing hydrogens though - carbanions aplenty! And if _createForcefield(water=True) is used for any process that commonly happens before protonation this could be very bad. So we'd want some further checks, like only applying this algorithm to certain elements or skipping it if the entire residue, molecule, chain, or file is missing hydrogens. But it's difficult to see how to differentiate in general between PDB files where a proton is missing and PDB files representing chemicals that lack that proton. For instance, we can contrive a case like fluoroformic acid, FCOOH, in which the deprotonated form FCOO- is indistinguishable in a OpenMM topology from the protonated form without hydrogens. We might be able to use the PDB charge column for this?

In general, I don't think this should hold up this PR, because the PR will only introduce regressions if the current behavior was correct by luck. I think detecting alternative forms is a substantial and orthogonal design problem.

EDIT: And actually, even then there's no regression, as the set of atom names will differ so the formal charges won't be applied.

On the positive side, if PDBFixer just added the hydrogens based on the CCD entry, that's the form it will use. So this does ensure that if PDBFixer adds both hydrogens and solvent, they'll be consistent with each other.

This'll need to be factored in if the hydrogen adding system becomes pH aware for CCD sourced residues - the same mechanism will want to be applied here.

The other thing I'm a bit unsure about is the handling of leaving atoms. You require that no leaving atoms be present. If any of them are, it won't match. Would it be better to make it match whether or not the leaving atoms are present?

Yeah I think I misunderstood what the point of leaving atoms were - they're for defining polymer linkages, but I thought they were about alternate protonation states. I'll fix this!

peastman commented 3 months ago

The test case is failing:

>       assert soluteCharge == numCl - numNa
E       assert -21 == (45 - 67)
peastman commented 3 months ago

I think you're right: this won't always get the right answer, but it's more likely to get the right answer than the current code, so it's still an improvement.

I think it would be worth also allowing the user to pass a ForceField to addSolvent(), which would be used instead of creating one based on templates. It's hardly any extra code, and it would allow users with custom residues to make sure they're getting the right charge based on the actual force field they'll be using to simulate it. If you prefer though, we can do that in a separate PR.

peastman commented 3 months ago

Just checking in on this so it doesn't get forgotten. The test case is still failing.

Yoshanuikabundi commented 2 months ago

Sorry Peter, I've been distracted! The failing test was a timeout from code I didn't touch downloading PDB files from the PDB. Hopefully it'll succeed this time - I think this one's good to go.

peastman commented 2 months ago

Just checking in again. :) See the one comment above, where if it can't download one definition, it won't return any others either. I think that line needs to be changed?

Yoshanuikabundi commented 2 months ago

Sorry I've been off sick - I agree, and have made that change. Thanks for being so patient!

peastman commented 2 months ago

Thanks, that looks good. I'm sorry you've been sick!