michellab / BioSimSpace

Code and resources for the EPSRC BioSimSpace project.
https://biosimspace.org
GNU General Public License v3.0
77 stars 19 forks source link

Failed to save system to format: 'PRM7' #251

Closed kexul closed 2 years ago

kexul commented 2 years ago

Hi there, this is an edge case that could be read by BioSimSpace but could not be converted to amber format:

mol = BSS.IO.readMolecules(['ligand.gro', 'ligand.top'])
BSS.IO.saveMolecules('ligand',mol, 'prm7')

gmx-amb.zip

lohedges commented 2 years ago

It looks like something funny has happened with the water topology detection code, since it is stating that all molecules in the system are waters, even though they aren't. This is then crashing when converting to AMBER format, where the water topology is swapped.

mol = BSS.IO.readMolecules(['ligand.gro', 'ligand.top'])
print(mol, mol[0])
<BioSimSpace.System: nMolecules=653> <BioSimSpace.Molecule: nAtoms=27, nResidues=1>

waters = mol.getWaterMolecules()
print(waters, waters[0].isWater())
<BioSimSpace.Molecules: nMolecules=653> False

I'll try to figure out what's going wrong.

lohedges commented 2 years ago

It looks like the topology file is wrong and all of the elements in the system are being loaded as dummy atoms for some reason. This is somehow breaking our water detection code. (I should fix this part, at least.)

mol = BSS.IO.readMolecules(['ligand.gro', 'ligand.top'])

print(mol[0]._sire_object.property("element"))
AtomProperty<SireMol::Element>( [ dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0) ] )

print(mol[1]._sire_object.property("element"))
AtomProperty<SireMol::Element>( [ dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0) ] )
kexul commented 2 years ago

Thanks, the files were taken from a Gromacs simulation tutorial, but I could not find its link right now 😥

lohedges commented 2 years ago

Okay, I've pushed a fix to Sire that correctly accounts for the presence of dummy atoms when detecting water molecules, i.e. so a molecule that is entirely made of dummies isn't erroneously marked as a water. This means that you can now convert your file to AMBER format. However, whether this is a valid file is another matter.

Let me know if you manage to track down the source of the GROMACS topology. If it is valid and we should be able to read it correctly, i.e. with fully element information, let me know and I'll try to work out what's gone wrong.

Cheers.

kexul commented 2 years ago

Thanks for the quick fix. I've checked the files' header and it seems to be generated by acpype. I just tested some other molecules generate by acpype in gromacs format,all of them failed to convert. I'm sure they are valid, we've run simulations with that files without any problem.

lohedges commented 2 years ago

I think that this is likely the issue:

[ atomtypes ]
;name   bond_type     mass     charge   ptype   sigma         epsilon       Amb
 c3       c3          0.00000  0.00000   A     3.39967e-01   4.57730e-01 ; 1.91  0.1094
 c        c           0.00000  0.00000   A     3.39967e-01   3.59824e-01 ; 1.91  0.0860
 o        o           0.00000  0.00000   A     2.95992e-01   8.78640e-01 ; 1.66  0.2100
 hc       hc          0.00000  0.00000   A     2.64953e-01   6.56888e-02 ; 1.49  0.0157
 OW       OW          0.00000  0.00000   A     3.15075e-01   6.35968e-01 ; 1.77  0.1520
 HW       HW          0.00000  0.00000   A     0.00000e+00   0.00000e+00 ; 0.00  0.0000

[ moleculetype ]
;name            nrexcl
 ligand          3

[ atoms ]
;   nr  type  resi  res  atom  cgnr     charge      mass       ; qtot   bond_type
    ¦1   c3     1   HOS    C1    1     0.111106     12.01000 ; qtot 0.111
    ¦2    c     1   HOS    C2    2     0.609212     12.01000 ; qtot 0.720
    ¦3   c3     1   HOS    C3    3    -0.480430     12.01000 ; qtot 0.240
    ¦4   c3     1   HOS    C4    4     0.089598     12.01000 ; qtot 0.329
    ¦5   c3     1   HOS    C5    5    -0.176806     12.01000 ; qtot 0.153
    ¦6   c3     1   HOS    C6    6    -0.108588     12.01000 ; qtot 0.044
    ¦7   c3     1   HOS    C7    7     0.441836     12.01000 ; qtot 0.486
    ¦8   c3     1   HOS    C8    8    -0.496402     12.01000 ; qtot -0.010
    ¦9   c3     1   HOS    C9    9    -0.496402     12.01000 ; qtot -0.507
    10   c3     1   HOS   C10   10    -0.470172     12.01000 ; qtot -0.977
    11    o     1   HOS    O1   11    -0.607595     16.00000 ; qtot -1.585
    12   hc     1   HOS    H1   12     0.148486      1.00800 ; qtot -1.436
    13   hc     1   HOS    H2   13     0.148486      1.00800 ; qtot -1.288
    14   hc     1   HOS    H3   14     0.009848      1.00800 ; qtot -1.278
    15   hc     1   HOS    H4   15     0.056431      1.00800 ; qtot -1.221
    16   hc     1   HOS    H5   16     0.056431      1.00800 ; qtot -1.165
    17   hc     1   HOS    H6   17     0.061127      1.00800 ; qtot -1.104
    18   hc     1   HOS    H7   18     0.061127      1.00800 ; qtot -1.043
    19   hc     1   HOS    H8   19     0.115452      1.00800 ; qtot -0.927
    20   hc     1   HOS    H9   20     0.115452      1.00800 ; qtot -0.812
    21   hc     1   HOS   H10   21     0.115452      1.00800 ; qtot -0.696
    22   hc     1   HOS   H11   22     0.116665      1.00800 ; qtot -0.580
    23   hc     1   HOS   H12   23     0.116665      1.00800 ; qtot -0.463
    24   hc     1   HOS   H13   24     0.116665      1.00800 ; qtot -0.346
    25   hc     1   HOS   H14   25     0.115452      1.00800 ; qtot -0.231
    26   hc     1   HOS   H15   26     0.115452      1.00800 ; qtot -0.115
    27   hc     1   HOS   H16   27     0.115452      1.00800 ; qtot -0.000

The mass and charge for all of the entries in the atomstypes record are zero. I imagine that Sire might be using this mass to infer the element, rather than using the value in the atoms section. I'll check when I get a chance.

lohedges commented 2 years ago

Nope, that's not it. I replace the values in the atomtypes section and it still loads everything as dummy atoms.

lohedges commented 2 years ago

This is where things are going wrong:

if (ok_elem)
{
    typ = GromacsAtomType(words[0], mass*g_per_mol, chg*mod_electron,
    ptyp, ::toLJParameter(v,w,combining_rule),
                          Element(nprotons));

  }
  else
  {
      //some gromacs files don't use 'nprotons', but instead give
      //a "bond_type" ambertype
      typ = GromacsAtomType(words[0], words[1], mass*g_per_mol, chg*mod_electron,
                            ptyp, ::toLJParameter(v,w,combining_rule));

}

For the acpype files we're falling into the else block, and all types are being flagged as dummies.

lohedges commented 2 years ago

Since the acpype files use bond_type, it looks like it is using the mass and charge from the atomtypes section to infer the element. I think I may have ran into Sire's file caching when I tried modifying it locally.

lohedges commented 2 years ago

It's not passing an element through to the constructor, so is using the default of dummy, i.e. SireMol::Element(0):

    GromacsAtomType(QString atom_type, QString bond_type,
                    SireUnits::Dimension::MolarMass mass,
                    SireUnits::Dimension::Charge charge,
                    PARTICLE_TYPE particle_type,
                    const LJParameter &ljparam,
                    const SireMol::Element &element = SireMol::Element(0));

It looks like any GROMACS topology in this format will have the same issue. This might be fine if you only intend to use the file for a GROMACS simulation, but might cause issues if converting to a different format that expects element information to be correct.

kexul commented 2 years ago

This might be fine if you only intend to use the file for a GROMACS simulation, but might cause issues if converting to a different format that expects element information to be correct.

Thanks for the detailed information, I'll keep that in mind. I just tried converting the format by parmed, and it works!

mol = BSS.IO.readMolecules(['ligand.gro', 'ligand.top'])
print(mol[0]._sire_object.property("element"))

mol_parmed = parmed.load_file('ligand.top', xyz='ligand.gro')
mol_parmed.save('amb.parm7')
mol_parmed.save('amb.rst7')
mol_amber = BSS.IO.readMolecules(['amb.parm7', 'amb.rst7'])
print(mol_amber[0]._sire_object.property("element"))

Output:

AtomProperty<SireMol::Element>( [ dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0) ] )
AtomProperty<SireMol::Element>( [ Carbon (C, 6),Carbon (C, 6),Carbon (C, 6),Carbon (C, 6),Carbon (C, 6),Carbon (C, 6),Carbon (C, 6),Carbon (C, 6),Carbon (C, 6),Carbon (C, 6),Oxygen (O, 8),Hydrogen (H, 1),Hydrogen (H, 1),Hydrogen (H, 1),Hydrogen (H, 1),Hydrogen (H, 1),Hydrogen (H, 1),Hydrogen (H, 1),Hydrogen (H, 1),Hydrogen (H, 1),Hydrogen (H, 1),Hydrogen (H, 1),Hydrogen (H, 1),Hydrogen (H, 1),Hydrogen (H, 1),Hydrogen (H, 1),Hydrogen (H, 1) ] )
lohedges commented 2 years ago

I'll take a look at how Parmed is inferring the element information. It might be one of those things that works if the atoms have standard GROMACS naming, but wouldn't work in general.

lohedges commented 2 years ago

Yes, ParmEd just naively guesses the element from the atom name. See here for details. In Sire we make no assumptions about the atom naming since it is irrelevant for the molecular potential. It also doesn't generalise since the file might have been converted from a different format (with its own naming convention) or customised by the user.

I'll have a think about how to warn the user if they read a GROMACS topology in this format. We might also be able to use a similar element_by_name function, but only in an optional way, i.e. if the user requests it. (We don't want to make mistakes by guessing things.)

lohedges commented 2 years ago

Actually, it looks like this might be a fallback after using the mass, which would be more robust. Could try that instead when I get a chance.

lohedges commented 2 years ago

Sire already has this functionality with Element::elementWithMass, but this wouldn't work for the acpype examples since all of the masses in the atomtypes record are zero.

kexul commented 2 years ago

Yes, ParmEd just naively guesses the element from the atom name. See here for details.

Hi @lohedges , I just tested anther OPLS type of gromacs topology file generated by acpype. Sire failed to read it.

Parmed correctly inferred the atom type, but there is no atomtype field in the topology file, maybe it's inferring from the atom field? test.zip

lohedges commented 2 years ago

I've pushed a workaround to Sire. This will infer the element from the mass if the topology file uses a bond type atom type, rather than providing the number of protons. If the mass in the atom record section doesn't match that of the atomtype, then it is re-inferred using this mass, which would give the correct elements for your acpype files.

kexul commented 2 years ago

Great!I'll test the updated version when it's available from conda.

kexul commented 2 years ago

Yes, everything worked like a charm! Many thanks for your work! Appreciated it!

kexul commented 2 years ago

Hi @lohedges , There is another edge case where BioSimSpace failed to identify the element type: ejm46.zip

Code to reproduce:

import BioSimSpace as BSS
import parmed 

mol = BSS.IO.readMolecules(['ejm46_AC.prmtop', 'ejm46_AC.inpcrd'])
print(mol[0]._sire_object.property("element"))

mol_parmed = parmed.load_file('ejm46_AC.prmtop', xyz='ejm46_AC.inpcrd')
elements = [atom.element_name for atom in mol_parmed]
print(elements)

Output:

AtomProperty<SireMol::Element>( [ dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),Carbon (C, 6),Carbon (C, 6),Nitrogen (N, 7),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),dummy (Xx, 0),Chlorine (Cl, 17),Carbon (C, 6),Carbon (C, 6),Carbon (C, 6),dummy (Xx, 0),Chlorine (Cl, 17),Hydrogen (H, 1),Hydrogen (H, 1),Hydrogen (H, 1),Hydrogen (H, 1),Hydrogen (H, 1),Hydrogen (H, 1),Hydrogen (H, 1),Hydrogen (H, 1),Hydrogen (H, 1),Hydrogen (H, 1),Hydrogen (H, 1),Hydrogen (H, 1),Hydrogen (H, 1) ] )

['O', 'C', 'N', 'C', 'C', 'C', 'N', 'C', 'N', 'C', 'O', 'C', 'C', 'C', 'C', 'C', 'C', 'Cl', 'C', 'C', 'C', 'C', 'Cl', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H']

I'm not sure whether it's a valid Amber parameter file, it's the result files I got by this package: https://github.com/Acellera/parameterize, they seem to use some customized atom type.

lohedges commented 2 years ago

There is no %FLAG ATOMIC_NUMBER section in the topology file, so it looks like Sire is inferring the element (incorrectly) in some other way. I imagine that we could play the same trick as with the GROMACS topology and get the element using the %FLAG_MASS record instead. I'll try this when I get a chance.

kexul commented 2 years ago

There is no %FLAG ATOMIC_NUMBER section in the topology file

Hi @lohedges , there is a ATOMIC_NUMBER section, but with some -1 in it.

%FLAG ATOMIC_NUMBER                                                             
%FORMAT(10I8)                                                                   
      -1      -1      -1      -1       6       6       7      -1      -1      -1
      -1      -1      -1      -1      -1      -1      -1      17       6       6
       6      -1      17       1       1       1       1       1       1       1
       1       1       1       1       1       1

I imagine that we could play the same trick as with the GROMACS topology and get the element using the %FLAG_MASS record instead. I'll try this when I get a chance.

Thanks, I just find that I could use parmed to read and save it back, then the atomic_number section is corrected, and BioSimSpace could infer the element type correctly. 😄

lohedges commented 2 years ago

Ah, sorry, I must have grep'ed the wrong thing! Yes, if that record is present and the information is incorrect, then BioSImSpace will set things as a dummy. I can just catch all atomic numbers equal or than zero and use the mass to infer the element instead. (Was trying with zero, but that obviously doesn't work here.)

lohedges commented 2 years ago

There we go, all sorted using the same approach as used for the GROMACS issue. Keep your janky topology files coming ;-)

kexul commented 2 years ago

Wow, ultrafast!!! I'll try it soon and report. 😋

lohedges commented 2 years ago

The build failed since there are unit tests that compare a parsed AMBER system against the new and old topology parser. Looks like I need to fix this in the old parser too, which might not be as easy. Will let you know when it's sorted.

lohedges commented 2 years ago

Okay, I've updated it in the old parser too (it was equally easy). I've checked that tests pass locally so it should build this time around. (I always forget that there are two AMBER parsers, which isn't desirable from a maintenance perspective.)

lohedges commented 2 years ago

Actually, we might not be able to do this since it will break our ability to load perturbable systems written to file by BioSimSpace. (Elements writen as dummies will be read as something else.) I'm now wondering if the GROMACS fix might cause issues too, e.g. with the way in which you read split GROMACS files. I think this will need more testing since I don't want to break something fundamental to BioSimSpace for the purposes of handling some atypical files.

lohedges commented 2 years ago

For now I think I'll change the code to only apply the fix when the atomic number is -1. This will catch these weird files, but leave massed dummy atoms required by FEP simulations unchanged.

kexul commented 2 years ago

Actually, we might not be able to do this since it will break our ability to load perturbable systems written to file by BioSimSpace.

That sounds horrible, thankfully the unit test catches the case.

kexul commented 2 years ago

Keep your janky topology files coming ;-)

Hi @lohedges , here it is: test_files.zip which was generated by this package https://github.com/selimsami/qforce .

It fails because the angle parameter cannot be parsed correctly:

Cannot convert a Urey-Bradley Gromacs angle into an expression of only the angle size if the UB bond force constant is non-zero - GromacsAngle( functionType() = Urey-Bradley, parameters() = [ 123.521, 518, 0.24792, 37938.8 ] )

Loading by parmed is fine, but the dihedral parameters are not parsing correctly. After loading and saving by parmed, Sire complains about the dihedral parameter:

Cannot convert the 'improper angle' type Gromacs dihedral 'GromacsDihedral( functionType() = improper dihedral, parameters() = [ 180, 203.756 ] )' to an expression using only the torsion angle 'phi'

Here is the code I've used:

import BioSimSpace as BSS
import parmed 

mol = BSS.IO.readMolecules(['gas.gro', 'gas.top'])
print(mol[0]._sire_object.property("element"))

mol_parmed = parmed.load_file('gas.top', xyz='gas.gro')
elements = [atom.element_name for atom in mol_parmed]
print(elements)

print(mol_parmed.dihedrals)

mol_parmed.save('new.top', overwrite=True)
mol_new = BSS.IO.readMolecules(['new.top', 'gas.gro'])
print(mol_new[0]._sire_object.property("element"))

I've run short energy minimization and MD simulation with the following .mdp file by Gromacs, the output seems to be reasonable, so the topology file should be valid.

integrator               = steep
nsteps                   = 500
integrator               = md
nsteps                   = 10000
lohedges commented 2 years ago

Looking at the Sire code it looks like we don't support this dihedral type. We can parse the file, but can't convert this to an computer algebra expression internally:

/** Return this function converted to a SireCAS::Expression using the passed symbol
    to represent the torsion size */
SireCAS::Expression GromacsDihedral::toExpression(const SireCAS::Symbol &phi) const
{
    const double kj_per_mol = kJ_per_mol.value();
    const double deg = degree.value();

    if (func_type == 1 or func_type == 4 or func_type == 9)
    {
        //standard periodic / non-periodic proper / improper dihedral
        // k ( 1 + cos( n phi - phi_s ))
        const double k0 = k[1] * kj_per_mol;
        const double phi_s = k[0] * deg;
        const double n = k[2];

        return k0 * ( 1.0 + Cos( (n*phi) - phi_s ) );
    }
    else if (func_type == 2)
    {
        throw SireError::incompatible_error( QObject::tr(
            "Cannot convert the 'improper angle' type Gromacs dihedral '%1' to an expression "
            "using only the torsion angle 'phi'")
                .arg(this->toString()), CODELOC );
    }

...

I'll take a look at the GROMACS manual to read up on function type 2. I know we only have full support for AMBER style force fields, but can convert formats where possible, e.g. Ryckaert-Bellemans dhedrals.

kexul commented 2 years ago

The build failed since there are unit tests that compare a parsed AMBER system against the new and old topology parser.

Hi @lohedges , I just checked the files in conda: https://anaconda.org/michellab/sire/files, 企业微信截图_16387823171106

The latest update seems to be 5 days ago, are the recent updates of file parsing (for example https://github.com/michellab/Sire/commit/0122d3fdbf8da61a5967722ac7bf83f259abb1a9) available through conda installation?

lohedges commented 2 years ago

Not yet, I'm afraid. We have a problem with the Sire Azure build following an update to the macOS 10.15 image (10.14 is now deprecated). The Linux build should be unaffected so I can run that in isolation later today if I can't fix it otherwise.

lohedges commented 2 years ago

The latest GROMACS issue looks like a bug in the GROMACS topology parser. It is always trying to resolve the potential expressions in the same way, irrespective of the function type. There is specific code to deal with the different formats, but it isn't being invoked. I'll try to switch it to calling a different method depending on the function type to see what happens.

lohedges commented 2 years ago

I can fix the GROMACS dihedral issue, but it looks like we don't support Urey-Bradley type Angle functions if the UB force constant is non-zero:

    else if (func_type == 5)
    {
        //Urey-Bradley : 0.5 k_t (theta - theta0)^2 + 0.5 k_b ( r - r0 )^2
        const double k0 = k[1] * kj_per_mol_per_rad2;
        const double theta0 = k[0] * deg;
        const double kb = k[3] * kj_per_mol_per_nm2;
        //const double r0 = k[2] * nm;

        if (kb != 0)
            throw SireError::incompatible_error( QObject::tr(
                "Cannot convert a Urey-Bradley Gromacs angle into an expression of only "
                "the angle size if the UB bond force constant is non-zero - %1")
                    .arg(this->toString()), CODELOC );

        return 0.5 * k0 * SireMaths::pow_2(theta - theta0);
    }
jmichel80 commented 2 years ago

I think it's not mathematically possible to do an exact UB to harmonic potential conversion in general. Parmed probably silently fails to convert correctly.

lohedges commented 2 years ago

Yes, I agree. That said, I didn't think Parmed stored things internally as an expression, so there wouldn't be an issue if you didn't try to convert it to a different format, i.e. UB would be okay if you just read and write to GROMACS format, but would fail if you tried to convert to AMBER. I'll look at the Parmed output file to see if has indeed incorrectly converted the terms.

lohedges commented 2 years ago

Just to note that the errors (angle and dihedral terms) are both present for the original and ParmEd files. The reasons that you sometimes see one or the other is that the topology is parsed in parallel, so sometimes the angles trigger an exception before the dihedrals, and vice versa. As expected, it looks like Parmed is simply preserving the topology since it is being written back to the same format (no conversion going on). The only thing it seems to do is zero certain UB angle terms, depending on the values of the force constant, e.g.

# Original (ejm46_minimised_qforce.itp)
     6     5    25     5    121.784      281.620   0.21799        0.000

# Parmed (new.top)
     6      5     25     5   121.7840000 281.620000 0.0000000 0.000000
kexul commented 2 years ago

Thanks for the detailed information, very clear!