plumed / plumed2

Development version of plumed 2
https://www.plumed.org
GNU Lesser General Public License v3.0
347 stars 280 forks source link

Errors in writing output #882

Open peastman opened 1 year ago

peastman commented 1 year ago

I'm investigating a problem reported by a user who is using PLUMED with OpenMM: https://github.com/openmm/openmm-plumed/issues/65. They report that it creates an empty COLVARS file, while writing the actual output to bck.0.COLVARS. However, it only happens when running the simulation on GPU, not on CPU. And it happens for one system, but not for others. And it happens if they perform an energy minimization at the start of the simulation, but not if they omit the energy minimization.

Given the inconsistent behavior, I suspect it may be a threading issue. Also of note: when running a simulation on the CPU, OpenMM invokes PLUMED on the main thread, but when running the simulation on the GPU, it invokes it from a different thread.

Any ideas on what the problem could be?

Thanks!

GiovanniBussi commented 1 year ago

@peastman @smliu1997 thanks for reporting.

PLUMED is expected to work correctly with multithreads, if you have a separate PLUMED instance in each thread. There's a regression bug in v2.8.0 that might cause random crashes in this case (solved in v2.8, tip of the branch, will be v2.8.2 at some point). In addition, there are some contributed collective variables that have non thread-safe storage, but they are not used in the reported example. However, I don't expect any issue with any version if: a single PLUMED object is created in one thread, then passed by reference or pointer (not copied) to another thread, then accessed from the other thread.

So, I would tend to rule out the threading issue, if there is a single Plumed object.

Is it possible that there are two Plumed objects simultaneously doing the same thing?

What the user is reporting might be caused by the following sequence:

Is it possible that you are creating two Plumed objects simultaneously reading the same input, and then only one of them is calling cmd("calc")?

peastman commented 1 year ago

Thanks, that's very helpful.

I do have a hypothesis about what might be happening, given the particular situation where this happens: running energy minimization on the GPU. OpenMM's CUDA platform accumulates forces in fixed point, which limits the maximum representable force to about +/- 2 billion. Of course, you never encounter forces that large in simulations (if you did, that would mean something had gone terribly wrong), so that isn't a problem. But the energy minimizer does occasionally produce unphysical conformations with huge forces as part of its internal line minimizations. When that happens we detect it, create a new context using the CPU platform, and repeat the force calculation for the problematic conformation.

I suspect that's happening here. Two contexts get created, and each one creates its own PLUMED instance. It would explain the behavior.

I'm not sure what to do about it, but that's a different question. OpenMM doesn't expect creating a context to have side effects.

GiovanniBussi commented 1 year ago

Yes I think that's the reason. However, it is strange that the first context can write stuff whereas the second cannot.

In theory, if the two contexts where created and destroyed sequentially, you would expect:

If this is what's happening, then the user should just concatenate the two files.

To make things more straightforward, if within OpenMM you know that "this is the second context", you can use:

int res=1;
plumed_cmd(plumedmain,"setRestart",&res);

to tell PLUMED that this is a restart. If the sequence is correct (i.e., the previous file was closed), the COLVARS file will be concatenated.

You however have to make sure that the first object is finalized (to flush all files) before the second object reads is input.

Could this be a solution?

PS I am not 100% how this solution would work on a network file system, but I guess that it should work since the two Plumed objects will see the same files even if they were not physically written yet.

GiovanniBussi commented 1 year ago

Additionally:

OpenMM doesn't expect creating a context to have side effects.

Taken literally, this is incompatible with history dependent methods such as metadynamics. Unless you trigger a restart as mentioned above, which brings the history information to the new context. But still it requires the side effect of "writing a file".

peastman commented 1 year ago

However, it is strange that the first context can write stuff whereas the second cannot.

@smliu1997 are you certain the output in bck.0.COLVARS is written during the simulation? Is it possible it just reflects some force evaluations done during energy minimization, and nothing more gets written after that?

Taken literally, this is incompatible with history dependent methods such as metadynamics.

Correct, history dependent methods can't be implemented from entirely inside a Force object. They need to be implemented at a higher level, either by the integrator or by tracking the history outside of the context.

invemichele commented 1 year ago

From what I understand the issue comes from the energy minimization routine being performed with plumed running. However, I don't see a scenario in which it is useful that plumed writes to file during an energy minimization, so a possible solution could be to turn off all PRINT commands during that. Even more important is to turn off any update during energy minimization. You don't want your metadynamics to start depositing Gaussians during energy minimization! I would say it's good practice to add the history dependent plumed force only after the system has been equilibrated.

However, I guess it can make sense to run energy minimization with some fixed plumed force (e.g. for umbrella sampling), and in that case the calculate plumed routine needs to be called.

smliu1997 commented 1 year ago

However, it is strange that the first context can write stuff whereas the second cannot.

@smliu1997 are you certain the output in bck.0.COLVARS is written during the simulation? Is it possible it just reflects some force evaluations done during energy minimization, and nothing more gets written after that?

Taken literally, this is incompatible with history dependent methods such as metadynamics.

Correct, history dependent methods can't be implemented from entirely inside a Force object. They need to be implemented at a higher level, either by the integrator or by tracking the history outside of the context.

I think the output in bck.0.COLVARS is the output for the production run, not energy minimization, because if I change the number of steps in the production run, then the output length changes correspondingly.