shirtsgroup / InterMol

Conversion tool for molecular simulations
MIT License
185 stars 53 forks source link

Desmond to gromacs conversion failing to parse quoted string #337

Open sup27606 opened 7 years ago

sup27606 commented 7 years ago

Desmond to gromacs conversion is failing with the following error: Traceback (most recent call last): File "/Users/supriyo/data/PostDoc Stuff/codes/InterMol/intermol/convert.py", line 811, in main() File "/Users/supriyo/data/PostDoc Stuff/codes/InterMol/intermol/convert.py", line 226, in main system, prefix = _load_desmond(cms_in=cms_in) File "/Users/supriyo/data/PostDoc Stuff/codes/InterMol/intermol/convert.py", line 589, in _load_desmond system = des.load(cms_file=cms_in) File "/Library/Python/2.7/site-packages/intermol/desmond/desmond_parser.py", line 39, in load return parser.read() File "/Library/Python/2.7/site-packages/intermol/desmond/desmond_parser.py", line 1141, in read self.load_ffio_block(molname, self.ffio_blockpos[i], self.fmct_blockpos[i+1]-1) File "/Library/Python/2.7/site-packages/intermol/desmond/desmond_parser.py", line 903, in load_ffio_block self.sysDirectivetype File "/Library/Python/2.7/site-packages/intermol/desmond/desmond_parser.py", line 698, in parse_dihedrals kwds = [float(x) for x in split[6:14]] ValueError: could not convert string to float: "B5SP"

More precisely, it is failing at the following lines in the cms file:

grep B5SP cbFDH_formate_vel5_2-out.cms 9 11294 11297 11298 11299 Proper_Trig 0.000000 1.300000 "B5SP" -0.600000 0.650000 0.050000 0.000000 0.000000 0.000000 101 11317 11320 11321 11322 Proper_Trig 0.000000 1.300000 "B5SP" -0.600000 0.650000 0.050000 0.000000 0.000000 0.000000 199 11369 11372 11373 11374 Proper_Trig 0.000000 1.300000 "B5SP" -0.600000 0.650000 0.050000 0.000000 0.000000 0.000000 291 11392 11395 11396 11397 Proper_Trig 0.000000 1.300000 "B5SP" -0.600000 0.650000 0.050000 0.000000 0.000000 0.000000

The cms file contains strings with quotations within the dihedral section. The parse_dihedral module in desmond_parser.py is trying to interpret it as float ( kwds = [float(x) for x in split[6:14]] )

The command used is: python convert.py --des_in cbFDH_formate_vel5_2-out.cms --gromacs --oname test

cbFDH_formate_vel5_2-out.cms.zip

sup27606 commented 7 years ago

First of all, thanks very much for your effort in making the interconversion between MD formats easier, and sharing your work with the community.

I fixed the parse_dihedral() errors by editing desmond_parser.py, but now there are more errors. Failing in parse_torsion_torsion(), because the format of torsion ffio block is different in the 2016 version of Desmond cms than what the program expects. In the sample cms file from your distro, the torsion ffio line looks like this: 1 3 15 16 17 15 16 17 25 cmap 1 In my cms file, it is this: 1 2 7 8 9 7 8 9 24 1 cmap

This is a simple fix for now, but not sure how many other changes have to be made. The problem is, these errors don't occur immediately when the script starts running.

I am not an expert in the cms format or python. However, in hindsight, the problem is, you are hardcoding all your array variables, how they index the ffio block parameters. However, cms (as far as I understand) is a flexible format. What you should really do is read the mapping of the cms block from the start of a ffio block, for example: ffio_torsion_torsion[722] { i_ffio_ai i_ffio_aj i_ffio_ak i_ffio_al i_ffio_am i_ffio_an i_ffio_ao i_ffio_ap i_ffio_c1 s_ffio_funct

and map your array variables accordingly. That way, the code won't break everytime the ordering of parameters in the ffio block changes.

Do you have any updated version of your software that takes care of the problems I am facing? Thank you.

ctk3b commented 7 years ago

Thank you @sup27606 for digging into this. What you are suggesting sounds like a very reasonable (and more correct) way to handle these ffio blocks.

Just to confirm, do these changes fix the problem you were having or is there still an energy discrepancy?

sup27606 commented 7 years ago

Hi Christoph, The conversion worked without further errors and the desmond energies are being written, but the gromacs energy is failing. It seems gmx is finding some problems with the gro file. Here is the output from grompp:

Program gmx grompp, VERSION 5.1 Source code file: /projects/Biogroup/GROMACS/gromacs-5.1.0-desktop/gromacs-5.1/src/gromacs/fileio/confio.c, line: 1048

Fatal error: Something is wrong in the coordinate formatting of file test.gro. Note that gro is fixed format (see the manual) For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors

I opened the gro file in VMD and it seems like some of the hydrogen coordinates look wrong (screenshot attached). Also, the x,y,z coordinates in the test.gro file has 12 decimal points, whereas my other gro files all have 3 decimal points. Could this be the issue, since gro is supposed to be a fixed format according to online resources? http://manual.gromacs.org/online/gro.html

Thanks. Supriyo screen shot 2017-05-31 at 2 35 29 pm test.gro.zip

ctk3b commented 7 years ago

Gro is a fixed file format but you can change the precision as stated in the link you provided:

Optionally (for now only yet with trjconv) you can write gro files with any number of decimal places, the format will then be n+5 positions with n decimal places (n+1 for velocities) in stead of 8 with 3 (with 4 for velocities). Upon reading, the precision will be inferred from the distance between the decimal points (which will be n+5)

Coordinate conversion errors have been an extremely rare issue though so this is interesting. I won't have time to dig deeper until the weekend. @mrshirts have you seen this before? Any immediate red flags?

sup27606 commented 7 years ago

@ctk3b, I think I know what's the problem, but I may be wrong. The flexible format applies only to the coordinates and velocities, but the first four records, ie. residue number, residue name, atom name, atom number are still fixed format. 5 positions each for these four records. Here is a small portion of the gro file produced by the program:

279TYR H9999 9999 -0.715489500000 0.923163500000 3.138024100000 279TYR H1000010000 -0.775758900000 1.171353300000 3.472607400000 279TYR H1000110001 -0.765668700000 0.715371900000 3.281781800000

The first line conforms to the gro format. However, in the second and third line, atom name exceeds 5 positions (H10000, H10001). If you rename the atoms such that the length of the atom name does not exceed 5 characters, I think it would solve the problem.

swails commented 7 years ago

The first line conforms to the gro format. However, in the second and third line, atom name exceeds 5 positions (H10000, H10001). If you rename the atoms such that the length of the atom name does not exceed 5 characters, I think it would solve the problem.

That seems to be the problem. FWIW, here's how ParmEd handles this

sup27606 commented 7 years ago

@swails, thanks for the pointer. In the grofile_parser.py, it is doing something like that, i.e. limiting the field lengths to 5 characters, but somewhere there may be a bug. I don't have enough python knowledge to figure out. Here is the snippet where it seems to be doing that.

swails commented 7 years ago

Here is the snippet where it seems to be doing that.

That will work for atom numbers, but not residue numbers.

Observe:

>>> len('{:5d}'.format(1))
5
>>> len('{:5d}'.format(1000000))
7

{:5d} just means that the resulting field will always have at least 5 characters (with leading spaces filling out any that are not used for numbers). However, unlike Fortran (which will not allow a field to be overfilled, and will fill a field with ***'s instead if a number is too large to fit in a given field), Python will simply use as many characters as it needs.

The way to fix that snippet is to replace

                gro.write('{0:5d}{1:<5s}{2:5s}{3:5d}'.format(
                        atom.residue_index, atom.residue_name, atom.name, (n + 1)%100000))

with

                gro.write('{0:5d}{1:<5s}{2:5s}{3:5d}'.format(
                        atom.residue_index%100000, atom.residue_name, atom.name, (n + 1)%100000))
mrshirts commented 7 years ago

Hi, all-

Thanks for the work on this -- I have some deadlines, but will work on this more tonight. We generally try to use the flexible formatting of the CMS, but may have missed it in one point. Cmap particularly was kludged in, so I'm not surprised there is an issue there (Desmond keeps changing how it handles things).

mrshirts commented 7 years ago

Hi, sup27606-

I see the issues here, but other than the .gro file issue, it might take a week or so to carve out the time to fix it right in the general case. For your particular problem, do you have enough information to fix the installation for the particular .cms version you have? Thanks for the help in identifying issues for the general case.

sup27606 commented 7 years ago

Hi @ctk3b , @mrshirts , The fix suggested by @swails corrected for the atom indices, but the atom names still caused problems, whenever they were longer than 5 characters. I replaced the relevant line with this:

gro.write('{0:5d}{1:<5s}{2:5s}{3:5d}'.format( atom.residue_index%100000, atom.residue_name[0:5], atom.name[0:5], (n + 1)%100000))

It is running now and producing energy output. There are some differences in energy between desmond vs gromacs. Are these differences reasonable? I am using the default configs from Intermol. Thanks.

Energy group summary

            type     input(desmond)  output (gromacs)    diff (gromacs)

Not comparable energies: are likely not to be the same

    coulomb (LR)         0.00000000     9594.75683594     9594.75683594
    coulomb (SR)         0.00000000 -1132896.87500000 -1132896.87500000
   coulomb total   -701198.87919120  -987112.35253906  -285913.47334786
      coulomb-14    136189.76776880   136189.76562500       -0.00214380
        improper      1503.64665220     1503.66784668        0.02119448
          proper     16113.43920960    15793.74607849     -319.69313111
    urey-bradley         0.00000000    11922.68359375    11922.68359375
        vdw (LR)         0.00000000    -7635.53271484    -7635.53271484
        vdw (SR)         0.00000000    62529.02343750    62529.02343750
       vdw total     70507.26184368    62872.06787109    -7635.19397259
          vdw-14      7978.54248528     7978.57714844        0.03466316

Comparable energy terms: these should be very close

           angle     23509.59789000    26668.10546875     3158.50757875
            bond     12581.90229488     9423.29199219    -3158.61030269
          bonded     53708.58604668    53388.81138611     -319.77466057
        dihedral     17617.08586180    17297.41392517     -319.67193663
       nonbonded   -630691.61734752  -924240.28466797  -293548.66732045
       potential   -875056.34542400  -870851.50000000     4204.84542400

---------------- Total Potential Energy Comparison -------------------- Input desmond potential energy: -875056.34542400 Difference in potential energy from desmond=>gromacs conversion: 4204.84542400

mrshirts commented 7 years ago

However, thought several of these are issues we fixed on a commit on March 13th. When was the version that you are using downloaded?

I have some code fixes for the other issues, but am still debugging them - hopefully can get a pull request in the next few days.

sup27606 commented 7 years ago

Thanks for the assessment. Although the relative error in the total potential is small, it is about one order of magnitude greater than the error in dihedral. So there are some differences in the nonbond besides the dihedral that are contributing to this difference in total potential. Is this a systemic error or a random error that is expected because of truncation of decimal places (angstrom to nm)?

I downloaded the code last week from github.

mrshirts commented 7 years ago

Is this a systemic error or a random error that is expected because of truncation of decimal places (angstrom to nm)?

Because the nonbonded cutoffs are practically impossible to match up exactly. if the LJ and Coul 1-4 interactions are this close, it means the translation of nonbondeds is correct - but you can't match the treatment of the long-range interactions precisely between programs.

I downloaded the code last week from github.

Hmm, let me keep looking at this.

sup27606 commented 7 years ago

Thank you very much. I plan to use the converted trajectories mainly for short range protein-ligand interaction energy. That way, the energy values should be more comparable.