Closed Yoshanuikabundi closed 2 months ago
Sorry for the delay in getting back to you. Your implementation looks good, and as you say, it's probably better than not doing it.
Yay!
I'm still a little concerned about residues that can appear in multiple forms. Take a look at the CCD definition of HIS, for example. (I know it's a standard residue, and therefore this code won't be used for it. I'm just using it as an example of what we can probably expect to find in nonstandard residues too.) It can be positive, negative, or neutral depending on which hydrogens are present. The CCD entry describes the positively charged, doubly protonated form. But that isn't the most common form at pH 7.
An example of this is the CCD definition of CIT, citric acid. At pH 7, citric acid is most commonly triply deprotonated, but the CCD entry is neutral. I agree it would be nice to handle this case more carefully, both here and in the protonation code.
I can imagine some simple algorithm along the lines of "add 1 formal charge to any atom that has an extra hydrogen bonded to it relative to the CCD, and remove 1 formal charge from any atom that is missing a hydrogen" which might improve the behavior in this PR for "nonstandard" protonation states. This would have pathological behavior on a PDB file that is missing hydrogens though - carbanions aplenty! And if _createForcefield(water=True)
is used for any process that commonly happens before protonation this could be very bad. So we'd want some further checks, like only applying this algorithm to certain elements or skipping it if the entire residue, molecule, chain, or file is missing hydrogens. But it's difficult to see how to differentiate in general between PDB files where a proton is missing and PDB files representing chemicals that lack that proton. For instance, we can contrive a case like fluoroformic acid, FCOOH, in which the deprotonated form FCOO- is indistinguishable in a OpenMM topology from the protonated form without hydrogens. We might be able to use the PDB charge column for this?
In general, I don't think this should hold up this PR, because the PR will only introduce regressions if the current behavior was correct by luck. I think detecting alternative forms is a substantial and orthogonal design problem.
EDIT: And actually, even then there's no regression, as the set of atom names will differ so the formal charges won't be applied.
On the positive side, if PDBFixer just added the hydrogens based on the CCD entry, that's the form it will use. So this does ensure that if PDBFixer adds both hydrogens and solvent, they'll be consistent with each other.
This'll need to be factored in if the hydrogen adding system becomes pH aware for CCD sourced residues - the same mechanism will want to be applied here.
The other thing I'm a bit unsure about is the handling of leaving atoms. You require that no leaving atoms be present. If any of them are, it won't match. Would it be better to make it match whether or not the leaving atoms are present?
Yeah I think I misunderstood what the point of leaving atoms were - they're for defining polymer linkages, but I thought they were about alternate protonation states. I'll fix this!
The test case is failing:
> assert soluteCharge == numCl - numNa
E assert -21 == (45 - 67)
I think you're right: this won't always get the right answer, but it's more likely to get the right answer than the current code, so it's still an improvement.
I think it would be worth also allowing the user to pass a ForceField to addSolvent()
, which would be used instead of creating one based on templates. It's hardly any extra code, and it would allow users with custom residues to make sure they're getting the right charge based on the actual force field they'll be using to simulate it. If you prefer though, we can do that in a separate PR.
Just checking in on this so it doesn't get forgotten. The test case is still failing.
Sorry Peter, I've been distracted! The failing test was a timeout from code I didn't touch downloading PDB files from the PDB. Hopefully it'll succeed this time - I think this one's good to go.
Just checking in again. :) See the one comment above, where if it can't download one definition, it won't return any others either. I think that line needs to be changed?
Sorry I've been off sick - I agree, and have made that change. Thanks for being so patient!
Thanks, that looks good. I'm sorry you've been sick!
This PR addresses #299 by:
_downloadCCDDefinition()
that parses and caches the definition._downloadFormalCharges()
which gets the formal charges of the atoms in a residue from the CCD_downloadFormalCharges()
in_createForceField(water=True)
to provide charges only when the residue is missing from the base force field and the (non-leaving) atom names in the CCD exactly match the atom names in the PDBThe refactor avoids re-downloading a CCD entry up to 3 times (once for protonation, once for missing heavy atoms, and a third time for formal charge) and reduces duplicate code. However, it is not essential for the other functionality and can be removed or split into its own PR if desired.
This solution works around the issues discussed in #299 where CCD residues tend to hard-code a single protonation state by only using the CCD formal charge when the force field does not provide the residue.
My reasoning is that there are two cases: first, when the force field includes a residue, it is capable of identifying the protonation state and gives the correct net charge and so it can only introduce regressions to interfere with this. In the second case, when a residue is introduced to the force field by
_createForceField(water=True)
, the atom's partial charge is currently hard-coded to 0 and so using more information to sometimes set that to better values is appropriate.We can break down the effect of this change (modulo bugs) like so:
amber14-all.xml
, behavior is unchanged. Otherwise:I think 2. should be much more common than 3., so this is a step in the right direction. In particular, in the very common case that PDBFixer is used to protonate a residue (with
addMissingHydrogens()
), this change guarantees that the protonation state will be consistent with the assumed charge - either both will be taken from the CCD, or the assumed charge will be inferred from the protonation state by the force field.I think something like an
offsetCharge
argument toaddSolvent()
andaddMembrane()
which would allow a user to correct any remaining errors in neutralising charge would be great, but this requires changes toModeller
and so is not included in this PR.