rinikerlab / reeds

This pipeline is executing a RE-EDS run for relative free energy calculations. It can be used for example for calculation of hydration free energies or ligand binding free energies.
https://rinikerlab.github.io/reeds/
MIT License
30 stars 8 forks source link

new repdat + imd format #47

Closed SalomeRonja closed 1 year ago

SalomeRonja commented 3 years ago

In the new parallel GromosXX RE-EDS code, the format of the repdat output file changed slightly. When a new repdat is parsed with the pygromos file parser, I get the following error message:

Traceback (most recent call last):
  File "job_analysis.py", line 76, in <module>
    do_Reeds_analysis(in_folder=in_folder, out_folder=out_folder, gromos_path=gromos_path, topology=topology, in_ene_ana_lib=in_ene_ana_lib, in_imd=in_imd, optimized_eds_state_folder=optimized_eds_state_folder, state_undersampling_occurrence_potential_threshold=state_undersampling_occurrence_potential_threshold, state_physical_occurrence_potential_threshold=state_physical_occurrence_potential_threshold, undersampling_frac_thresh=undersampling_frac_thresh, grom_file_prefix=grom_file_prefix, title_prefix=title_prefix, n_processors=n_processors, verbose=verbose, control_dict=control_dict, )
  File "/cluster/work/igc/wsalome/code/reeds/reeds/function_libs/pipeline/worker_scripts/analysis_workers/RE_EDS_general_analysis.py", line 303, in do_Reeds_analysis
    boundary_conditions=boundary_conditions)
  File "/cluster/work/igc/wsalome/code/reeds/reeds/function_libs/file_management/file_management.py", line 1266, in reeds_project_concatenation
    repdat_file_paths=repdat_file_paths, verbose=verbose)
  File "/cluster/work/igc/wsalome/code/reeds/reeds/function_libs/file_management/file_management.py", line 208, in thread_worker_concat_repdat
    repdat_file = repdat.Repdat(repdat_file_paths.pop(0))  # repdat Class
  File "/cluster/work/igc/wsalome/code/reeds/reeds/submodules/pygromos/pygromos/files/repdat.py", line 48, in __init__
    system, data = parser.read_repdat(input_path, Vj_header=True)
  File "/cluster/work/igc/wsalome/code/reeds/reeds/submodules/pygromos/pygromos/files/_basics/parser.py", line 608, in read_repdat
    df = df.astype({'ID': 'int32', 'partner': 'int32', 'run': 'int32','s': 'int32'})
  File "/cluster/work/igc/wsalome/anaconda3/envs/py37/lib/python3.7/site-packages/pandas/core/generic.py", line 5526, in astype
    "Only a column name can be used for the "
KeyError: 'Only a column name can be used for the key in a dtype mappings argument.'

The issue can be solved by doing the renaming df.rename(columns={"coord_ID": "ID"}, inplace=True).

What I don't understand is line df.rename(columns={"exch": "s"}, inplace=True). In the old repdat format, there is a column called exch, which contains booleans that encode whether the exchange between two replicas was accepted or rejected. The same column is still present in the new format, and it doesn't correspond to the s-values. @SchroederB do you know what the idea behind this is?

SalomeRonja commented 3 years ago

Hmm, after a closer look, it should probably be df.rename(columns={"start": "ID"}, inplace=True)instead of my original suggestion df.rename(columns={"coord_ID": "ID"}, inplace=True)

SalomeRonja commented 3 years ago

There is also an issue with the indices: in the previous repdat format, all indices started at 1, which is taken into account e.g. in the transition_dict member function _caculate_transition_traces of the Repdat class. In the new repdat format, on the other hand, the indices start at 0, so e.g. transition_dict[int(row.ID)]) leads to a KeyError for row.ID=0. Mŷ solution for now is to just increase the corresponding indices in the new repdat file during the parsing:

if min(df.ID) == 0:
            for i in range(0, len(df.pos)):
                df.pos[i] = df.loc[i]['pos']+1
                df.ID[i] = df.loc[i]['ID']+1
                df.coord_ID[i] = df.loc[i]['coord_ID']+1
                df.partner[i] = df.loc[i]['partner']+1
                df.partner_start[i] = df.loc[i]['partner_start']+1
                df.partner_coord_ID[i] = df.loc[i]['partner_coord_ID']+1

But in the long run, when we move towards using only the new code, we should clean this up and adapt the repdat-dependent functions to the new format

SalomeRonja commented 3 years ago

adding to the TODO list: the function edit_REEDS in imd.py is currently not writing the energy offsets correctly for the new imd format- maybe also some of the other variables -> I'll need to check which edits work and fix the ones that don't work

SalomeRonja commented 3 years ago

can be fixed by changing elif(isinstance(reeds_block, blocks.REPLICA_EDS)): to elif(isinstance(reeds_block, blocks.NEW_REPLICA_EDS)): in edit_REEDS

I'll make a new branch with all these changes :)

SalomeRonja commented 3 years ago

Another thing to keep in mind is how to do the automatic adaptation of the imds for the new NEOFF parameter, which is not done at the moment. I'll add it to the new branch as well

SalomeRonja commented 3 years ago
SalomeRonja commented 3 years ago

I started a pull request in the pygromos submodule which should fix these issues: https://github.com/rinikerlab/PyGromosTools/pull/79

I'd still like to keep the issue open, since we should adapt to the new repdat format completely, when we stop using the old gromosXX code

RiesBen commented 3 years ago

was merged :)

RiesBen commented 3 years ago

sorry just read the last sentence of your post

SalomeRonja commented 3 years ago

Yeah, it's low priority, but now that we're making the switch to the new code/repdat format, I think it would be nicer if we avoided the work of increasing all the replica indices in the repdat, and directly used the indices starting at 0 instead of 1 for the whole s-optimization analysis.

candidechamp commented 2 years ago

@SalomeRonja Do you still want to keep this issue open? I think everyone has been using the new gromosXX code and repdat file fomat/parser now and it's been working well.

I'm not sure if the indexing change you mention was done or not though.