plumed / plumed2

Development version of plumed 2
https://www.plumed.org
GNU Lesser General Public License v3.0
344 stars 279 forks source link

Multi replicas simulation with OpenMM-Plumed plugin #947

Open husseinmur opened 1 year ago

husseinmur commented 1 year ago

Hello,

I am trying to run a Metadynamics Metainference simulation with multi replicas using OpenMM and Plumed. I am using the OpenMM-Plumed plugin to add the Plumed script to the forcefield.

Although I managed to run it for one replica, it seems like the plugin doesn't support multi replicas right now. I modified its source code and passed an MPI communicator from Python to C++ through Swig, then used plumed_cmd(plumedmain, "setMPIComm", &comm); to set the communicator. Now, Plumed output shows that it is running on multiple nodes (whatever number I choose after the np flag when I run mpirun -np 4 python simulate.py), but still the number of replicas shown in the output is 1. I looked at the Gromacs patch and figured that I need to add another communicator, there should be inter and intra comms, is that right?

I passed another one and wrote the following piece of code to make use of the comms:

int done_already;
int intra_comm_rank;
MPI_Initialized(&done_already);
MPI_Comm intra_comm = force.getIntracom();
MPI_Comm inter_comm = force.getIntercom();
MPI_Comm_rank(intra_comm, &intra_comm_rank);
if (!done_already)
{
    MPI_Init(NULL, NULL);
}
if (intra_comm_rank == 0)
    plumed_cmd(plumedmain, "GREX setMPIIntercomm", &inter_comm);
plumed_cmd(plumedmain, "GREX setMPIIntracomm", &intra_comm);
plumed_cmd(plumedmain, "GREX init");

After the changes, the Python script hangs when I run it with MPI and doesn't show any output. However, if include the GREX commands inside the if (!done_already), it shows the following error message:

Traceback (most recent call last):
  File "/mnt/f/CALVADOS/single_chain/ACTR/mpi/simulate.py", line 149, in <module>
    simulation.minimizeEnergy()
  File "/home/hussein/miniconda3/envs/try4/lib/python3.9/site-packages/openmm/app/simulation.py", line 137, in minimizeEnergy
    mm.LocalEnergyMinimizer.minimize(self.context, tolerance, maxIterations)
  File "/home/hussein/miniconda3/envs/try4/lib/python3.9/site-packages/openmm/openmm.py", line 17208, in minimize
    return _openmm.LocalEnergyMinimizer_minimize(context, tolerance, maxIterations)
openmm.OpenMMException:
+++ PLUMED error
+++ message follows +++
This command expects a floating point type. Received a void instead

Should the Intercommunicator be passed from the Python script just like the Intracommunicator? Is there something I am missing?

Here are the relevant code snippet for reference:

The Swig file:

%module openmmplumed

%import(module="simtk.openmm") "swig/OpenMMSwigHeaders.i"
%include "swig/typemaps.i"
%include "std_string.i"
%include "/home/hussein/miniconda3/lib/python3.9/site-packages/mpi4py/include/mpi4py/mpi4py.i"
%mpi4py_typemap(Comm, MPI_Comm);

%{
#include "PlumedForce.h"
#include "OpenMM.h"
#include "OpenMMAmoeba.h"
#include "OpenMMDrude.h"
#include "openmm/RPMDIntegrator.h"
#include "openmm/RPMDMonteCarloBarostat.h"
#include <mpi.h>
%}

%pythoncode %{
import simtk.openmm as mm
%}

namespace PlumedPlugin {

class PlumedForce : public OpenMM::Force {
public:
    PlumedForce(const std::string& script, const MPI_Comm intra_comm, const MPI_Comm inter_comm);
    const std::string& getScript() const;
};

}

The Python script (snippet that involves MPI):

comm1 = MPI.COMM_WORLD
comm2 = MPI.COMM_WORLD
system.addForce(PlumedForce(script, comm1, comm2))
simulation = Simulation(top, system, integrator)

simulation.context.setPositions(pdb.positions)

simulation.minimizeEnergy()

simulation.reporters.append(DCDReporter(DCD_output_file, N_save))

simulation.reporters.append(StateDataReporter(stats_output_file, N_save, step=True, potentialEnergy=True, temperature=True))

simulation.step(100000)

The Openmm-Plumed kernel:

void ReferenceCalcPlumedForceKernel::initialize(const System& system, const PlumedForce& force) {
  plumedmain = plumed_create();
  int done_already;
  int intra_comm_rank;
  MPI_Initialized(&done_already);
  MPI_Comm intra_comm = force.getIntracom();
  MPI_Comm inter_comm = force.getIntercom();
  MPI_Comm_rank(intra_comm, &intra_comm_rank);
  if (!done_already)
  {
      MPI_Init(NULL, NULL);
  }
  if (intra_comm_rank == 0)
      plumed_cmd(plumedmain, "GREX setMPIIntercomm", &inter_comm);
  plumed_cmd(plumedmain, "GREX setMPIIntracomm", &intra_comm);
  plumed_cmd(plumedmain, "GREX init");
  hasInitialized = true;
  int apiVersion;
  plumed_cmd(plumedmain, "getApiVersion", &apiVersion);
  if (apiVersion < 4)
      throw OpenMMException("Unsupported API version.  Upgrade PLUMED to a newer version.");
  int precision = 8;
  plumed_cmd(plumedmain, "setRealPrecision", &precision);
  double conversion = 1.0;
  plumed_cmd(plumedmain, "setMDEnergyUnits", &conversion);
  plumed_cmd(plumedmain, "setMDLengthUnits", &conversion);
  plumed_cmd(plumedmain, "setMDTimeUnits", &conversion);
  plumed_cmd(plumedmain, "setMDEngine", "OpenMM");
  plumed_cmd(plumedmain, "setLog", force.getLogStream());
  int numParticles = system.getNumParticles();
  plumed_cmd(plumedmain, "setNatoms", &numParticles);
  double dt = contextImpl.getIntegrator().getStepSize();
  plumed_cmd(plumedmain, "setTimestep", &dt);
  double kT = force.getTemperature() * BOLTZ;
  if (kT >= 0.0)
      plumed_cmd(plumedmain, "setKbT", &kT);
  int restart = force.getRestart();
  plumed_cmd(plumedmain, "setRestart", &restart);
  plumed_cmd(plumedmain, "init", NULL);
}
GiovanniBussi commented 1 year ago

Although I managed to run it for one replica, it seems like the plugin doesn't support multi replicas right now. I modified its source code and passed an MPI communicator from Python to C++ through Swig, then used plumed_cmd(plumedmain, "setMPIComm", &comm); to set the communicator. Now, Plumed output shows that it is running on multiple nodes (whatever number I choose after the np flag when I run mpirun -np 4 python simulate.py), but still the number of replicas shown in the output is 1. I looked at the Gromacs patch and figured that I need to add another communicator, there should be inter and intra comms, is that right?

Yes, correct.

I passed another one and wrote the following piece of code to make use of the comms:

int done_already;
int intra_comm_rank;
MPI_Initialized(&done_already);
MPI_Comm intra_comm = force.getIntracom();
MPI_Comm inter_comm = force.getIntercom();
MPI_Comm_rank(intra_comm, &intra_comm_rank);
if (!done_already)
{
    MPI_Init(NULL, NULL);
}
if (intra_comm_rank == 0)
    plumed_cmd(plumedmain, "GREX setMPIIntercomm", &inter_comm);
plumed_cmd(plumedmain, "GREX setMPIIntracomm", &intra_comm);
plumed_cmd(plumedmain, "GREX init");

After the changes, the Python script hangs when I run it with MPI and doesn't show any output. However, if include the GREX commands inside the if (!done_already), it shows the following error message:

Traceback (most recent call last):
  File "/mnt/f/CALVADOS/single_chain/ACTR/mpi/simulate.py", line 149, in <module>
    simulation.minimizeEnergy()
  File "/home/hussein/miniconda3/envs/try4/lib/python3.9/site-packages/openmm/app/simulation.py", line 137, in minimizeEnergy
    mm.LocalEnergyMinimizer.minimize(self.context, tolerance, maxIterations)
  File "/home/hussein/miniconda3/envs/try4/lib/python3.9/site-packages/openmm/openmm.py", line 17208, in minimize
    return _openmm.LocalEnergyMinimizer_minimize(context, tolerance, maxIterations)
openmm.OpenMMException:
+++ PLUMED error
+++ message follows +++
This command expects a floating point type. Received a void instead

Should the Intercommunicator be passed from the Python script just like the Intracommunicator? Is there something I am missing?

My guess is that MPI_Init is called outside, do that done_already is true, and you are just skipping those calls. I cannot say where the exception comes from (I realise not that this would be a good case for adding more context with a nested exception, but this requires modifying plumed). You can try to set export PLUMED_STACK_TRACE=1 and see the stack trace.

Is the error disappearing if you are not using MPI?

Regarding the communicators you are passing, the code seems correct (provided force.getIntracom() and force.getIntercom()` do the right thing).

In the case with the hanging run, can you find out where it's hanging (old-fashion debugging, putting std::cerr<<"A"<<std::endl; after every line in the C++ code)?

husseinmur commented 1 year ago

Thanks for your quick answer! I placed the MPI_Init outside and it looks like the code is hanging after plumed_cmd(plumedmain, "GREX init"); If I remove it, the script runs and throws that same exception of float expected but received void.

And yes, the exception used to show when I was only pasing one comm (and setting it using plumed_cmd(plumedmain, "setMPIComm", &comm);, but it dissapeared when I wrapped the OpenMM simulation code with if (rank == 0) in Python.

I tried export PLUMED_STACK_TRACE=1 but no more info is shown. Maybe OpenMM is only returning data to one thread for some reason?

GiovanniBussi commented 1 year ago

Thanks for your quick answer! I placed the MPI_Init outside and it looks like the code is hanging after plumed_cmd(plumedmain, "GREX init"); If I remove it, the script runs and throws that same exception of float expected but received void.

That command is the first one doing a global communication. It will hang if something goes wrong with MPI.

I checked also in PLUMED driver (src/cltools/Driver.cpp) and it seems you also need to call setMPIComm (see here). However, your code hangs before (I think) so this is not the problem.

Could you also find out where the exception is thrown? I don't know why the stack tracing is not working. It is expected to make the message (that it then reported by OpenMM verbatim) much more verbose. Maybe it's not propagated by mpirun? Some MPI versions require to set the environment variables on the command line (e.g. mpirun -x A=1 or something similar).

husseinmur commented 1 year ago

Thanks for your quick answer! I placed the MPI_Init outside and it looks like the code is hanging after plumed_cmd(plumedmain, "GREX init"); If I remove it, the script runs and throws that same exception of float expected but received void.

That command is the first one doing a global communication. It will hang if something goes wrong with MPI.

Is there any way to show what's going wrong with MPI? Is it correct that both intra and inter communicators are instances of MPI.COMM_WORLD or should the intercommunicator have a different type?

I am thinking maybe there is something wrong with how MPI is running the script because if I do in the Python script:

if (rank == 0):
   print("A")
   simulation.step(1000)

then run the file with mpirun -np 4 python simulate.py, it prints A three times. This is not a normal behavior I guess?

I checked also in PLUMED driver (src/cltools/Driver.cpp) and it seems you also need to call setMPIComm (see here). However, your code hangs before (I think) so this is not the problem.

Makes sense. Thanks for pointing out!

Could you also find out where the exception is thrown? I don't know why the stack tracing is not working. It is expected to make the message (that it then reported by OpenMM verbatim) much more verbose. Maybe it's not propagated by mpirun? Some MPI versions require to set the environment variables on the command line (e.g. mpirun -x A=1 or something similar).

The exception is thrown here:

PLUMED:   temperature of the system 2.477710
PLUMED:   MC steps 1
PLUMED:   initial standard errors of the mean 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
PLUMED:   added component to this action:  af_mi.neff
PLUMED:   added component to this action:  af_mi.acceptSigma
PLUMED:   added component to this action:  af_mi.sigmaMean-0
PLUMED:   added component to this action:  af_mi.sigma-0
PLUMED:   added component to this action:  af_mi.sigmaMean-1
PLUMED:   added component to this action:  af_mi.sigma-1
################################################################################
PLUMED:
PLUMED:
PLUMED: +++ PLUMED error
PLUMED: +++ message follows +++
PLUMED: This command expects a floating point type. Received a void instead
PLUMED: If you are sure your code is correct you can disable this check with export PLUMED_TYPESAFE_IGNORE=yes
PLUMED: In case this is necessary, please report an issue to developers of PLUMED and of the MD code
PLUMED: See also https://github.com/plumed/plumed2/pull/653
PLUMED:
PLUMED: ################################################################################

Also here:

PLUMED:   on file BAYES.af
PLUMED:   with format  %f
PLUMED: Action ENDPLUMED
PLUMED:   with label @13
PLUMED: END FILE: (temporary)
PLUMED: 
PLUMED:
PLUMED: ################################################################################
PLUMED:
PLUMED:
PLUMED: +++ PLUMED error
PLUMED: +++ message follows +++
PLUMED: This command expects a floating point type. Received a void instead
PLUMED: If you are sure your code is correct you can disable this check with export PLUMED_TYPESAFE_IGNORE=yes
PLUMED: In case this is necessary, please report an issue to developers of PLUMED and of the MD code
PLUMED: See also https://github.com/plumed/plumed2/pull/653
PLUMED:
PLUMED: ################################################################################

I think both are coming from:

    simulation.minimizeEnergy()
  File "/home/hussein/miniconda3/envs/try4/lib/python3.9/site-packages/openmm/app/simulation.py", line 137, in minimizeEnergy
    mm.LocalEnergyMinimizer.minimize(self.context, tolerance, maxIterations)
  File "/home/hussein/miniconda3/envs/try4/lib/python3.9/site-packages/openmm/openmm.py", line 17208, in minimize
    return _openmm.LocalEnergyMinimizer_minimize(context, tolerance, maxIterations)
GiovanniBussi commented 1 year ago

Is there any way to show what's going wrong with MPI? Is it correct that both intra and inter communicators are instances of MPI.COMM_WORLD or should the intercommunicator have a different type?

No it is not! If every replica is simulated with one process:

Maybe this is the problem? This cannot be diagnosed with your initial script, because it is not clear what force.getIntracom() returns. I assumed it's an OpenMM defined communicator.

husseinmur commented 1 year ago

Updating the MPI comms types as you suggested solved the problem of hanging. The Plumed output shows now that it is running on multiple replicas. Thank you! However, the same exception is still thrown:

Traceback (most recent call last):
  File "/mnt/f/CALVADOS/single_chain/ACTR/mpi/simulate.py", line 146, in <module>
    simulation.minimizeEnergy()
  File "/home/hussein/miniconda3/envs/try4/lib/python3.9/site-packages/openmm/app/simulation.py", line 137, in minimizeEnergy
    mm.LocalEnergyMinimizer.minimize(self.context, tolerance, maxIterations)
  File "/home/hussein/miniconda3/envs/try4/lib/python3.9/site-packages/openmm/openmm.py", line 17208, in minimize
    return _openmm.LocalEnergyMinimizer_minimize(context, tolerance, maxIterations)
openmm.OpenMMException:
+++ PLUMED error
+++ message follows +++
This command expects a floating point type. Received a void instead
If you are sure your code is correct you can disable this check with export PLUMED_TYPESAFE_IGNORE=yes
In case this is necessary, please report an issue to developers of PLUMED and of the MD code
See also https://github.com/plumed/plumed2/pull/653

I tried to run the simulation without MPI and still, the full traceback doesn't show. Is there anything else that can be done to debug this?

GiovanniBussi commented 1 year ago

The raised exception says that the plumed_cmd has been passed an incorrect type.

You should find out which call is raising the issue, so we can see if there's really a bug in the way plumed_cmd was called or if there's a bug in PLUMED.

  1. I made a mistake in the reply above, the correct setting is export PLUMED_STACK_TRACE=yes. However, this will just give you the location in the code with some approximate line numbers. It's sometime not easy to detect the exact function call.

  2. If this doesn't work, you might patiently add prints to find out at which plumed_cmd call the code is crashing.

husseinmur commented 1 year ago

Thank you! I have tried as you suggested and it looks like the exception is thrown here:

double virial[9];
plumed_cmd(plumedmain, "setVirial", &virial);

It is weird because this used to work fine with no MPI involved.

However, when I pass the first item instead of the whole array, like this:

double virial[9];
plumed_cmd(plumedmain, "setVirial", &virial[0]);

The code works but then OpenMM throws an exception:

Traceback (most recent call last):
  File "/mnt/f/CALVADOS/single_chain/ACTR/mpi/simulate.py", line 154, in <module>
    simulation.step(100000)
  File "/home/hussein/miniconda3/envs/try4/lib/python3.9/site-packages/openmm/app/simulation.py", line 141, in step
    self._simulate(endStep=self.currentStep+steps)
  File "/home/hussein/miniconda3/envs/try4/lib/python3.9/site-packages/openmm/app/simulation.py", line 206, in _simulate
    self.integrator.step(10) # Only take 10 steps at a time, to give Python more chances to respond to a control-c.
  File "/home/hussein/miniconda3/envs/try4/lib/python3.9/site-packages/openmm/openmm.py", line 14755, in step
    return _openmm.LangevinIntegrator_step(self, steps)
openmm.OpenMMException: Particle coordinate is NaN.  For more information, see https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#nan

I thought maybe the simulation is exploding for some reason. So instead of minimizing energy on each thread, I saved a "post-energy minimization" checkpoint of a non-MPI simulation and then started the replica simulation from there. All exceptions disappeared and the code ran through.

Is what I have done correct? Why is it exploding when minimizing energy over all threads?

GiovanniBussi commented 1 year ago

plumed_cmd(plumedmain, "setVirial", &virial);

This is an incorrect type passed (though the pointer actually has the same value, so would work if it was not for the typechecking

plumed_cmd(plumedmain, "setVirial", &virial[0]);

This is the correct thing to do.

I suspect the code is now correct and there might be something wrong with the input or setup.

Can you try to use PLUMED to do something trivial, e.g. add a force equal to 1e-10 to the distance between two atoms? The system should behave as if PLUMED was not used (except for tiny numerical differences)