michellab / Sire

Sire Molecular Simulations Framework
http://siremol.org
GNU General Public License v3.0
95 stars 26 forks source link

Atomtypes with the sames name but different parameters are not preserved during merge #382

Closed msuruzhon closed 2 years ago

msuruzhon commented 2 years ago

Hi,

I have discovered an issue, where if one tries to merge two molecules which have common atomtype names but different associated parameters, the parameters of the second molecule effectively get lost during writing out.

For example, let the first atomtype be as follows (as taken from a GROMACS top file):

H           1    1.007940    0.000000       A    0.500000    0.400000

and the second atomtype be as follows:

H           1    1.007940    0.000000       A    0.300000    0.600000

The resulting merged topology only has the following definitions:

     H           1    1.007940    0.000000       A    0.500000    0.400000
  H_du           0    0.000000    0.000000       A    0.000000    0.000000

Files to reproduce this behaviour can be found here.

This is a problem that shouldn't occur when one uses GAFF, because of the strict atom type names. However, OpenFF doesn't have canonical atom type names, so an H1 in one molecule will not in general be the same as an H1 in another molecule. I think we need to figure out how to uniquify the atom types of the two molecules if they have different parameters but the same name.

lohedges commented 2 years ago

Thanks for this. Ironically I've just stumbled across this issue myself. In addition, there appear to be a few other wrinkles when using OpenFF, e.g. molecules being returned with no ligand name when using SMILES, which would break writing of GROMACS topology files. (See here for details.

As you say, we'll need some tricks to uniquify the atom types to ensure that all atom definitions appear in the topology. I'll look into this when I get the chance. It is probably easiest to add this as a hard requirement of the merging code itself, rather than tweaking things when topology files are written. I guess we might need to be careful if specify atom names have special meanings for an engine, but this is generally been okay in practice. (Apart from naming of water molecules and dummy atoms, in my experience.)

Cheers.

lohedges commented 2 years ago

Having looked at the Sire.IO.GroTop parser I think there actually might be a bug in the way that the [ atomtypes ] record is created in that it only considers the types from the lambda=0 state. (This is irrespective of types with the same name having different parameters.) This won't be a problem when types in the lambda = 1 state also exist at lambda = 0, but would if when the lambda = 1 state contains unique types. I'll check this in more detail and update when I've figured things out. (We probably need to combine atomtypes in both end states to write the record.)

lohedges commented 2 years ago

No, I was wrong, we do inspect both end states. This is just an issue of dealing with atom types with the same name, but different parameters, which should be possible to solve in the merge function.

lohedges commented 2 years ago

I have a strategy to uniquify the atomtype properties, which should hopefully work for GROMACS. Just to check, does this cause issues with the AMBER FEP setup too? If so, I might need to think about uniquifying the ambertype property too.

chryswoods commented 2 years ago

I think that ambertype should be unique, as there is more agreement what those types mean (they are defined in the gaff or FFXX forcefields, and have unique names). Gromacs treats atom types differently, if I remember correctly.

msuruzhon commented 2 years ago

So far I can't reproduce this issue using only AMBER so we might be good for now. I will let you know if I find anything interesting.

lohedges commented 2 years ago

Sorry, I know that ambertype will be unique if the molecule is parameterised using one of the AMBER force fields. The issue is that this won't be the case if using OpenFF, so you'll get duplicate types with different parameters during a merge. This isn't an issue for SOMD (at least, I don't think it is) due to the way the perturbation is set up, but causes problems for GROMACS since the [ atomtypes ] record is generated from the atomtype property of the two end states, with no reference to whether the parameters, e.g. LJ, actually differ. I've fixed this by uniquifying the atomtypes on merge, but I'm not sure if I also need to do the same for the ambertypes, in case this causes issues in the way the AMBER FEP topology is set up.

lohedges commented 2 years ago

Great, thanks for confirming. I'll push a fix once I've had a chance to test a little further. At the moment it just adds an x to the atomtype property if a duplicate is found. I only check for duplicates twice, since (presumably) there would only be one duplicate of a given type, i.e. because we are merging two molecules. (It shouldn't be possible to have a duplicate within an end state.)

msuruzhon commented 2 years ago

(It shouldn't be possible to have a duplicate within an end state.)

Ah, but that's another potential issue - OpenFF atom type names clash in general with protein atom type names so if both of them are combined, this will be the case. Or are these automatically renamed when two systems are combined?

lohedges commented 2 years ago

Oh dear, that's a whole other can of worms. If this is the case, then it isn't a problem specific to merging, rather combining OpenFF parametrised molecules in any way and writing to file. As you suggest, unique types will either need to be enforced in the system combining functions, or when writing to file within the topology parsers. If this is a GROMACS specific issue, then it is probably best resolved in the Sire.IO.GroTop parser itself.

lohedges commented 2 years ago

(It would be even nicer if OpenFF could just generate unique names in the first place :-) )

chryswoods commented 2 years ago

I agree it is a can of worms. The AmberPrm parser assumes that the names refer to unique atom types (%FLAG AMBER_ATOM_TYPE). However, the amber types are not actually used for anything in a prmtop file, so this doesn't matter (they would only be used by, e.g. antechamber, to assign LJ parameters and bond/angle/dihedral parameters). The actual LJ parameters are used to get unique LJ types (%FLAG ATOM_TYPE_INDEX) which index into the arithmetically combined parameters in %FLAG NONBONDED_PARM_INDEX. Similarly, the actual bond/angle/dihedral parameters are used to set the values in the prmtop file.

So, for AmberPrm it is an annoyance (different amber atom types will have different LJ parameters, which may be unexpected) but it is not breaking. For GroTop, it is an issue because this format requires that all atom types are consistent (have the same mass, vdw parameters, element etc). I think you are right that this needs fixing in the GroTop parser, which should detect if an atom type is not consistent, and then rename it as needed.

lohedges commented 2 years ago

This will actually need to be done at the System level, since the types need to be self consistent between the Gro87 and GroTop files and it would be a pain to implement the same logic in both. When I get a chance I'll work on writing a C++ wrapper to do this, which can then be called before writing to Gro87 or GroTop format.

What a pain!

(It was actually fairly straightforward to fix the [ atomtypes ] part, I just then realised that it was only one part of the problem.)

chryswoods commented 2 years ago

Painful. That will be really slow to do at the System level, because it is O(n^2) - every atom type in the added molecules will need checking against every atom type in the system. Adding molecules one by one, and performing the check each time, will be expensive.

msuruzhon commented 2 years ago

I wrote a similar function that fixes that in ParmEd and as long as you use sets, it is pretty fast even on protein-ligand systems, so I guess this shouldn't be that bad for Sire either? Usually there are not that many atom types anyway.

chryswoods commented 2 years ago

Yes, I agree it will be quick if we use sets, and only do this operation when we want to write the molecule to a file. Thanks :-)

lohedges commented 2 years ago

Actually, I think we might be okay with my original solution of simply fixing the [ atomtypes ] record in the topology file. I forgot that the Gro87 file just contains a list of named atoms by index, which are matched against the [ atoms ] section from the GroTop file in order. The entries in the Gro87 file give the atom names, not types, and it appears that it's fine to have duplicate names for different types. (At least grommp doesn't complain and the simulation runs without error.)

I'll try to think of a simple test to check that things are working as expected.

lohedges commented 2 years ago

I've just pushed a potential fix that renames types if the name matches an existing type but the parameters differ. This is done in the topology writer only, so the properties in the original system will remain unchanged. It resolves the issue for the minimal example that you've provided, but I'm not sure if there's anything else that I might have overlooked. If you could test it out when you get a chance, that would be great. (Perhaps there are other naming issues with OpenFF that we need to be careful with.)

To test, you should only need to update the Sire development package once the build has finished:

conda update -c conda-forge -c michellab/label/dev sire

(BioSimSpace remains unchanged, hence why I transferred the issue here.)

Cheers.

msuruzhon commented 2 years ago

Many thanks for that @lohedges , I will let you know if I find any other issues.

lohedges commented 2 years ago

I've just run a quick test combining two OpenFF molecules and it's not quite working as expected. (More atom types are generated than necessary.) Will try to figure out what I've done wrong and will push a fix.

lohedges commented 2 years ago

Okay, I think this is working now. It was more complicated than I thought since it wasn't just a case of creating a new type if there was an existing type with the same name and different parameters, i.e. an H1 in one molecule having different LJ parameters to an H1 in another. It turns out that an atom type in one molecule might actually match an existing type in the other, so there is no need to create a new type, just re-use the one that it maps to, e.g. just updating the type names. For instance, an H1 in one molecule might actually have the same parameters as, say, an H2 in the other.

I've also added a test in BioSimSpace that creates two OpenFF parameterised molecules, translates one of them so that they are sufficiently far apart so they don't interact, then creates a combined system. I then check that the sum of the single point energies for the two molecules in isolation is the same as that of the composite system. This fails prior to my fix (one molecule has incorrect parameters for one of the atom types), but works following it.