radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

Support GPU MPS on Summit #100

Closed mturilli closed 4 years ago

mturilli commented 4 years ago

@wjlei1990 to provide some details in this ticket.

wjlei1990 commented 4 years ago

@wjlei1990 finish the doc: https://docs.google.com/document/d/1uugJsRRSTHDhMFb3NWaRRPwyoMiNPKn6sr2iO3IqMaQ/edit#heading=h.k670rad7dcz1

wjlei1990 commented 4 years ago

The summit support GPUMPS. The user just need to put one extra line in their batch script:

#BSUB -alloc_flags gpumps

And then the jsrun to configure the resource allocation correctly. Using our current specfem software as an example, we used 384 mpis, so previously it will use 384 cpu cores and 384 GPUs.(each cpu core and GPU will only handle one MPI)

So the jsrun run command would be:

jsrun -n384 -a1 -c1 -g1 ./bin/xspecfem3D

Say now we want to use 2 mpis on 1 gpu, the jsrun command would be:

jsrun -n192 -a2 -c2 -g1 ./bin/xspecfem3D.

If set gpumpi to 4, then

jsrun -n96 -a4 -c4 -g1 ./bin/xspecfem3D.

More details could be found here in the summit doc: https://www.olcf.ornl.gov/for-users/system-user-guides/summit/summit-user-guide/#running-jobs

mturilli commented 4 years ago

Radical team has to discuss how to render this in RP.

iparask commented 4 years ago

https://docs.olcf.ornl.gov/systems/summit_user_guide.html#mps

andre-merzky commented 4 years ago

Note from RP discussion: we likely express this as GPU threads, and cap by max number of shareable ranks.

wjlei1990 commented 4 years ago

Error using pip install radical.ensemblemd in python3 and virtualenv.

(entk) lei@login2 ~/software/summit/virtualenv $ 
pip install radical.ensemblemd
Collecting radical.ensemblemd
  Using cached radical.ensemblemd-0.4.6.tar.gz (100 kB)
    ERROR: Command errored out with exit status 1:
     command: /autofs/nccs-svm1_home1/lei/software/summit/virtualenv/entk/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-q2trqlc0/radical.ensemblemd/setup.py'"'"'; __file__='"'"'/tmp/pip-install-q2trqlc0/radical.ensemblemd/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-q2trqlc0/radical.ensemblemd/pip-egg-info
         cwd: /tmp/pip-install-q2trqlc0/radical.ensemblemd/
    Complete output (6 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-q2trqlc0/radical.ensemblemd/setup.py", line 108
        def visit((prefix, strip, found), dirname, names):
                  ^
    SyntaxError: invalid syntax
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
andre-merzky commented 4 years ago

This looks like a Python version issue - looks like you are using an old version of EnTK with Python3?

wjlei1990 commented 4 years ago

Hi, I am using pip install to directly install entk. I guess I should install from source right?

mturilli commented 4 years ago

Not critical, Rutgers to discuss internally how to best support this.

mturilli commented 4 years ago

@wjlei1990 could you provide us with a link to the batch job script + task code you are using to use GPU MPS on Summit? This would greatly help us to shape our discussion about how to support it in EnTK/RP.

wjlei1990 commented 4 years ago

Example lsf script without GPUMPS:

#!/bin/bash

#BSUB -P GEO111
#BSUB -W 00:30
#BSUB -nnodes 64
#BSUB -J solver
#BSUB -o log.solver.%J

jsrun -n 384 -a 1 -c 1 -g 1 ./bin/xspecfem3D

Example script with GPUMPS

#!/bin/bash

#BSUB -P GEO111
#BSUB -W 00:30
#BSUB -nnodes 16
#BSUB -J solver
#BSUB -o log.solver.%J
#BSUB -alloc_flags gpumps

jsrun -n 96 -a 4 -c 4 -g 1 ./bin/xspecfem3D

The difference is:

  1. #BSUB -alloc_flags gpumps to enable GPUMPS
  2. jsrun -n 96 -a 4 -c 4 -g 1 ./bin/xspecfem3D will uses allows 4 mpi running on 1 single GPU card.

Do you just want the script or you want some running example?

mturilli commented 4 years ago

Thank you very much! If sharing some running example requires no relevant effort then yes, that might be useful too.

wjlei1990 commented 4 years ago

Thank you very much! If sharing some running example requires no relevant effort then yes, that might be useful too.

Hi Matteo, you may find running example here:

/gpfs/alpine/world-shared/geo111/lei/entk/specfem3d_globe_990cd4

There are 3 lsf scripts using GPUMPS from 1 to 4:

  1. job_solver.bash
  2. job_solver.mps2.bash
  3. job_solver.mps4.bash

I also did some performance benchmark by running the task:

GPUMPS Job Time (sec) Core GPU Simulation Time (sec)
1 81 50
2 133 103
4 194 162

The core simulation is measuring just the time marching in the SPECFEM solver. The solver needs extra time to read and setup the mesher. Based on our experiment, it takes about 30-32s and it is mainly running on CPU. So we see that it keeps almost constant among experiments since we are not really saturating the CPU power on summit.