running modeling.py in a cluster environment

heejongkim commented 3 years ago

Hi,

I would like to perform the computational expensive modeling.py across multiple nodes. However, it seems like modeling.py with its associated script is made for a single machine. Do you have any examples or recommendations how to accomplish that? I assume that i need to use mpi but I'm not sure how to properly modify to maximize the speed, efficiency, and replica exchange across nodes.

Thanks.

best, hee jong

saltzberg commented 3 years ago

Hi Hee Jong,

To run parallel-processing replica exchange, IMP must be compiled using MPI. For example, mpicxx. To install IMP, use the CMAKE flag -DCMAKE_CXX_COMPILER=/usr/local/bin/mpicxx. You can read more about using the CMAKE flags for installing IMP here.

One can then use mpirun to initiate a parallel job, e.g.:

mpirun -np 4 python modeling.py

which will perform a single modeling run with four replicas.

Running multiple modeling runs on a cluster requires setting up a script specific for that cluster software and architecture. Once you have successfully been able to run a single parallel replica exchange simulation using the command above, you should be able to use that line in your cluster submission script.

benmwebb commented 3 years ago

To install IMP, use the CMAKE flag -DCMAKE_CXX_COMPILER=/usr/local/bin/mpicxx

This isn't a great idea because it will result in all of IMP being compiled with MPI. Only the IMP::mpi module needs to be compiled with MPI. As long as mpicxx and friends are in your PATH CMake should do the right thing. Most of the prebuilt IMP binaries (e.g. homebrew, Anaconda, RPM) are built with MPI support.

heejongkim commented 3 years ago

Thanks for both.

@saltzberg I already compiled the imp with the cluster's mpicxx and all that and put it in the module. What I actually got confused about is rnapolii/modeling/run_rnapolii_modeling.sh looping through N and n_steps. And this repo's modeling.py takes those info + output path as arguments so I wanted to make sure how to properly edit those to meet mpirun "expectations".

For example, from previous RNA pol II tutorial, I made the following SLRUM script to submit the modeling job.

!/usr/bin/bash

SBATCH --partition=defq

SBATCH --output=logfiles/%j.out

SBATCH --error=logfiles/%j.err

SBATCH --nodes=8

SBATCH --ntasks-per-node=48

module load imp/2.13.0 ## this will automatically load as well as unload dependencies and conflicts mpirun --map-by node python modeling.py ## instead of using -np, used --map-by coupled with --ntasks-per-node to specify the number of threads per node

it would be awesome if you can help me convert the for loop in bash to mpirun command. Thank you for your guidance.

@benmwebb Would it cause any serious issues if I set CXX_COMPILER to mpicxx? Due to the cluster environment complexity, I preferred to be explicit so I set that up and compiled. If i don't, sometimes it's a little bit difficult to keep track of which compiler/libraries were used for this specific instance. Thank you for your insight.

heejongkim commented 3 years ago

So, I just changed

global_output_directory="output" in ReplicaExchange0

and

num_frames to fixed value instead of sys.argv to take number from command line

and utilized my SLURM script to submit with mpirun

I've been watching the log and queue for an hour and it seems emitting expected outputs and not failing. If you have any other suggestions to improve, please let me know.

Thanks!

heejongkim commented 3 years ago

Ah... seems like it's hitting something with mpirun.

Same data, topology and almost identical modeling.py (only change is sys.argv portion) Even single node run, mpirun spit the following error while run_rnapolii_modeling.sh is fine to start the iteration.

Traceback (most recent call last): File "init.py", line 141, in max_srb_rot=0.3) File "/cm/shared/apps/imp/2.13.0/lib64/python3.7/site-packages/IMP/pmi/macros.py", line 723, in execute_macro self.root_hier = self.system.build() File "/cm/shared/apps/imp/2.13.0/lib64/python3.7/site-packages/IMP/pmi/topology/init.py", line 155, in build state.build(kwargs) File "/cm/shared/apps/imp/2.13.0/lib64/python3.7/site-packages/IMP/pmi/topology/init.py", line 260, in build mol.build(kwargs) File "/cm/shared/apps/imp/2.13.0/lib64/python3.7/site-packages/IMP/pmi/topology/init.py", line 747, in build self, rep, self.coord_finder, rephandler) File "/cm/shared/apps/imp/2.13.0/lib64/python3.7/site-packages/IMP/pmi/topology/system_tools.py", line 275, in build_representation model) File "/cm/shared/apps/imp/2.13.0/lib64/python3.7/site-packages/IMP/isd/gmm_tools.py", line 40, in decorate_gmm_from_text weight=float(fields[2]) IndexError: list index out of range

Any suggestions and/or insights are very much appreciated.

Thanks.

salilab / imp_analysis_tutorial

running modeling.py in a cluster environment #2

!/usr/bin/bash

SBATCH --partition=defq

SBATCH --output=logfiles/%j.out

SBATCH --error=logfiles/%j.err

SBATCH --nodes=8

SBATCH --ntasks-per-node=48