Issue with Multiprocessing

cplateVT commented 11 months ago

I am trying to perform free energy calculations on a very large number of proteins. I wrote a routine in python which includes first processing the RCSB structure using the PDBFixer python module, and then going on to perform minimization and MM/GBSA calculations. The routine is working properly in serial, but when I parallelize the routine over multiple cores on a single node using a multiprocessing Pool, the routine stalls indefinitely without error when the PDBFixer class is instantiated. All other parts of my routine parallelize smoothly. Below I've isolated the protein preparation routine using PDBFixer, and then applied a pool to tackle a few hundred tasks in parallel so you can replicate the stalling behavior. I would appreciate any insight into why I might be experiencing this issue, and what changes I may be able to make to the pdbfixer code or my own code in order to resolve the issue. THX

import pandas as pd
import os
from multiprocessing import Pool
from pdbfixer import PDBFixer
from openmm.app import PDBFile
import wget
os.environ['OPENBLAS_NUM_THREADS'] = '1'

input_csv = 'core.csv'
n_tasks = 128
outdir = 'output'
if not os.path.exists(outdir):
    os.makedirs(outdir)

df = pd.read_csv(input_csv, index_col=0)
all_pdbs = df.index.tolist()

def job(pdbid):
    if not os.path.exists(os.path.join(outdir, pdbid)):
        os.mkdir(os.path.join(outdir, pdbid))
    pdburl = 'https://files.rcsb.org/download/' + pdbid.upper() + '.pdb'
    rcsbpdb = os.path.join(outdir, os.path.join(pdbid, 'dirty.pdb'))
    cleanpdb = os.path.join(outdir, os.path.join(pdbid, 'clean.pdb'))
    wget.download(pdburl, out=rcsbpdb)
    fixer = PDBFixer(filename=rcsbpdb)
    fixer.findMissingResidues()
    fixer.findNonstandardResidues()
    fixer.replaceNonstandardResidues()
    fixer.removeHeterogens(False)
    fixer.findMissingAtoms()
    fixer.addMissingAtoms()
    fixer.addMissingHydrogens(7.0)
    f = open(cleanpdb, 'w')
    PDBFile.writeFile(fixer.topology, fixer.positions, f)
    f.close()
    return

if __name__ == '__main__':
    with Pool(n_tasks) as p:
        dat = p.map(job, all_pdbs)

peastman commented 11 months ago

Can you identify exactly which line it's hanging at? Try sticking print() commands after each line so you can see how far it gets.

cplateVT commented 11 months ago

Great idea! here is the output from a debugging script after putting print statements following each step. The program never returned after 10 minutes. Out of 285 jobs, 167 jobs were finished while 118 remain stalling.

started: 285 finished: 167 running: 118

instantiated: 285 missing_residues_found: 285 nonstandard_residues_found: 285 nonstandard_residues_replaced: 285 heterogens_removed: 285 missing_atoms_found: 285 missing_atoms_added: 190 missing_hydrogens_added: 167

As you can see, it seems that the program begins a deadlock while adding missing atoms! The program does continue to work on the jobs but at an increasingly slow pace, and after two hours, it never finishes.

peastman commented 11 months ago

That's the first function that tries to do any simulation with OpenMM. Trying to parallelize it is probably not a good idea. I suspect you're overloading your GPU, asking it to run hundreds of simulations at once. Or if it's picking the CPU platform, each simulation spawns enough threads to use all your CPU cores, leading to thousands of threads all trying to run at once.

Try setting your Pool to use a much smaller number of processes, maybe something in the range of about 3 to 5.

It also would be informative to know which platform it's picking and what device it's running on. Try executing this script. What is the output?

from openmm import *
system = System()
system.addParticle(1.0)
integrator = VerletIntegrator(0.001)
context = Context(system, integrator)
print(context.getPlatform().getName())
for prop in context.getPlatform().getPropertyNames():
    print(prop, context.getPlatform().getPropertyValue(context, prop))

cplateVT commented 11 months ago

This makes a lot of sense to me. I suspected that the OpenMM module was attempting to use many cores leading to resource availability issues. That being said, is there a way to force MM to only use one core, so that I can run them all in parallel (even if it takes longer)? This would be ideal for my situation and research goals. Here is the output you were looking for:

CPU Threads 128 DeterministicForces false

peastman commented 11 months ago

Set the environment variable OPENMM_CPU_THREADS to 1. That will tell it only to create one thread for each simulation.

cplateVT commented 11 months ago

I added: os.environ['OPENMM_CPU_THREADS'] = '1'

The behavior appears to be the same: started: 285 finished: 187 running: 98

instantiated: 285 missing_residues_found: 285 nonstandard_residues_found: 285 nonstandard_residues_replaced: 285 heterogens_removed: 285 missing_atoms_found: 285 missing_atoms_added: 206 missing_hydrogens_added: 187

After 40 minutes, it is completely stalled. It has been stuck at 187 finished jobs for 10 minutes now without any progress.

peastman commented 11 months ago

I don't think you can set it from the Python script. By that point, OpenMM has probably already gotten initialized and read the value of the environment variable. Maybe if it's the very first line, before any imports.

cplateVT commented 11 months ago

RESOLVED. Setting the environment variable before importing openmm did the trick. It finished all of the jobs in ~30 minutes. Most of the jobs finished within 5 minutes but a few exceedingly large proteins took the rest of the time to complete

Thank you SO much for your help!

peastman commented 11 months ago

Great, glad that worked! Is it ok to close this issue?

cplateVT commented 11 months ago

YES THANKS

openmm / pdbfixer

Issue with Multiprocessing #277