Closed cplateVT closed 1 year ago
Can you identify exactly which line it's hanging at? Try sticking print()
commands after each line so you can see how far it gets.
Great idea! here is the output from a debugging script after putting print statements following each step. The program never returned after 10 minutes. Out of 285 jobs, 167 jobs were finished while 118 remain stalling.
started: 285 finished: 167 running: 118
instantiated: 285 missing_residues_found: 285 nonstandard_residues_found: 285 nonstandard_residues_replaced: 285 heterogens_removed: 285 missing_atoms_found: 285 missing_atoms_added: 190 missing_hydrogens_added: 167
As you can see, it seems that the program begins a deadlock while adding missing atoms! The program does continue to work on the jobs but at an increasingly slow pace, and after two hours, it never finishes.
That's the first function that tries to do any simulation with OpenMM. Trying to parallelize it is probably not a good idea. I suspect you're overloading your GPU, asking it to run hundreds of simulations at once. Or if it's picking the CPU platform, each simulation spawns enough threads to use all your CPU cores, leading to thousands of threads all trying to run at once.
Try setting your Pool to use a much smaller number of processes, maybe something in the range of about 3 to 5.
It also would be informative to know which platform it's picking and what device it's running on. Try executing this script. What is the output?
from openmm import *
system = System()
system.addParticle(1.0)
integrator = VerletIntegrator(0.001)
context = Context(system, integrator)
print(context.getPlatform().getName())
for prop in context.getPlatform().getPropertyNames():
print(prop, context.getPlatform().getPropertyValue(context, prop))
This makes a lot of sense to me. I suspected that the OpenMM module was attempting to use many cores leading to resource availability issues. That being said, is there a way to force MM to only use one core, so that I can run them all in parallel (even if it takes longer)? This would be ideal for my situation and research goals. Here is the output you were looking for:
CPU Threads 128 DeterministicForces false
Set the environment variable OPENMM_CPU_THREADS
to 1
. That will tell it only to create one thread for each simulation.
I added: os.environ['OPENMM_CPU_THREADS'] = '1'
The behavior appears to be the same: started: 285 finished: 187 running: 98
instantiated: 285 missing_residues_found: 285 nonstandard_residues_found: 285 nonstandard_residues_replaced: 285 heterogens_removed: 285 missing_atoms_found: 285 missing_atoms_added: 206 missing_hydrogens_added: 187
After 40 minutes, it is completely stalled. It has been stuck at 187 finished jobs for 10 minutes now without any progress.
I don't think you can set it from the Python script. By that point, OpenMM has probably already gotten initialized and read the value of the environment variable. Maybe if it's the very first line, before any imports.
RESOLVED. Setting the environment variable before importing openmm did the trick. It finished all of the jobs in ~30 minutes. Most of the jobs finished within 5 minutes but a few exceedingly large proteins took the rest of the time to complete
Thank you SO much for your help!
Great, glad that worked! Is it ok to close this issue?
YES THANKS
I am trying to perform free energy calculations on a very large number of proteins. I wrote a routine in python which includes first processing the RCSB structure using the PDBFixer python module, and then going on to perform minimization and MM/GBSA calculations. The routine is working properly in serial, but when I parallelize the routine over multiple cores on a single node using a multiprocessing Pool, the routine stalls indefinitely without error when the PDBFixer class is instantiated. All other parts of my routine parallelize smoothly. Below I've isolated the protein preparation routine using PDBFixer, and then applied a pool to tackle a few hundred tasks in parallel so you can replicate the stalling behavior. I would appreciate any insight into why I might be experiencing this issue, and what changes I may be able to make to the pdbfixer code or my own code in order to resolve the issue. THX