gpu/xk nodes settings - Githubissues

euhruska commented 6 years ago

To use OpenMM with GPUs one needs xk nodes, what would be the correct settings inresource_config.rcfg to get xk nodes? Is it simply PILOTSIZE = 256:xk?

vivek-bala commented 6 years ago

Ah using GPUs requires an unreleased version of RP and a different branch of EnTK. There isn't too much documentation currently on it. I'll translate the CPU version into a GPU one. Can you point me to the branch I should work on?

euhruska commented 6 years ago

openmm branch

vivek-bala commented 6 years ago

Can you give me a timeline of when this would be required? Is it ok if this is available on on the 29th or so?

euhruska commented 6 years ago

29th is fine

euhruska commented 6 years ago

how does the GPU setup look like?

vivek-bala commented 6 years ago

Added the GPU enabled script now. Take a look. I added comments at the beginning of the script to describe the changes.

Note that you have to use the following branches (the order of installation is important):

radical.utils - devel saga - feature/gpu radical.pilot - feature/gpu radical.entk - feature/gpu

I have changed the installation instructions in the README.

Let me know how it goes.

euhruska commented 6 years ago

should I use 'ncsa.bw' or 'ncsa.bw_aprun'?

vivek-bala commented 6 years ago

I would start with ncsa.bw_aprun

euhruska commented 6 years ago

currently 'ncsa.bw_aprun' gives an error even without gpu, https://github.com/radical-collaboration/extasy-grlsd/issues/44, but I will test once it works again

euhruska commented 6 years ago

how do I have to change the resource description?

2018-02-02 11:07:54,988: radical.entk.resource_manager: MainProcess                     : MainThread     : ERROR   : Failed to validate resource description, error: Error: Mandatory key cpus does not exist in the resource description
Error: Error: Mandatory key cpus does not exist in the resource description
Traceback (most recent call last):
  File "extasy_grlsd.py", line 301, in <module>
    rman = ResourceManager(res_dict)
  File "x/conda/envs/extasy3-gpu/lib/python2.7/site-packages/radical/entk/execman/resource_manager.py", line 48, in __init__
    if self._validate_resource_desc(resource_desc):
  File "x/conda/envs/extasy3-gpu/lib/python2.7/site-packages/radical/entk/execman/resource_manager.py", line 228, in _validate_resource_desc
    raise Error(text='Mandatory key %s does not exist in the resource description'%key)
Error: Error: Mandatory key cpus does not exist in the resource description

vivek-bala commented 6 years ago

The resource dictionary is different for the gpu branch. See the example at the top of the script: https://github.com/radical-collaboration/extasy-grlsd/blob/feature/gpu/extasy_grlsd2.py#L52-L62. You need to explicitly mention the 'cpus' and 'gpus' to acquire.

euhruska commented 6 years ago

ah, missed the branch

euhruska commented 6 years ago

Do I understand correctly that he the openmm task be assigned 16 cpus and 1 gpu if I don't set cpu_reqs only sim_task.gpu_reqs = { 'processes': 1, 'process_type': 'MPI', 'threads_per_process': 16, 'thread_type': 'OpenMP' } ?

euhruska commented 6 years ago

how to define mq_connection?

2018-02-02 14:38:51,298: radical.entk.appmanager: MainProcess                     : MainThread     : ERROR   : Error setting RabbitMQ system: global name 'mq_connection' is not defined
2018-02-02 14:38:51,298: radical.entk.appmanager: MainProcess                     : MainThread     : ERROR   : Error in AppManager
Traceback (most recent call last):                        
  File "/scratch1/eh22/conda/envs/extasy3-gpu/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 225, in run
    setup = self._setup_mqs()                             
  File "/scratch1/eh22/conda/envs/extasy3-gpu/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 557, in _setup_mqs
    self._mq_channel = mq_connection.channel()            
NameError: global name 'mq_connection' is not defined

vivek-bala commented 6 years ago

NameError: global name 'mq_connection' is not defined

Fixed that. Please pull and reinstall entk.

Do I understand correctly that he the openmm task be assigned 16 cpus and 1 gpu if I don't set cpu_reqs only sim_task.gpu_reqs = { 'processes': 1, 'process_type': 'MPI', 'threads_per_process': 16, 'thread_type': 'OpenMP' } ?

No. You have to explicitly mention the cpu reqs.

sim_task.gpu_reqs = { 'processes': 1, 'process_type': 'MPI', 'threads_per_process': 16, 'thread_type': 'OpenMP' }

means 16 gpus (processes*threads_per_process) for each sim_task. If you require 16 cpus and 1 gpu for your task, your task description needs to look like ():

sim_task.cpu_reqs = { 'processes': 1, 'process_type': None, 'threads_per_process': 16, 'thread_type': 'OpenMP' }  # if the simulation is going to spawn 1 process with 16 threads (i.e. non-mpi)
sim_task.cpu_reqs = { 'processes': 1, 'process_type': None, 'threads_per_process': 16, 'thread_type': 'OpenMP' }  # if the simulation is going to spawn 16 process with 1 threads (i.e. mpi)
sim_task.gpu_reqs = { 'processes': 1, 'process_type': None, 'threads_per_process': 1, 'thread_type': None}

Does that help?

euhruska commented 6 years ago

helpful, with openmm I intend to use 1 cpu, 1 gpu

euhruska commented 6 years ago

2018-02-02 15:09:45,721: radical.entk.resource_manager: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: ERROR   : Pilot has failed
2018-02-02 15:09:45,739: radical.entk.resource_manager: MainProcess                     : MainThread     : ERROR   : Resource request submission failed 2018-02-02 15:09:45,740: radical.entk.appmanager: MainProcess                     : MainThread     : ERROR   : Error in AppManager                      
Traceback (most recent call last):                                                                                                                        File "/scratch1/eh22/conda/envs/extasy3-gpu/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 242, in run                      
    self._resource_manager._submit_resource_request()                                                                                                     File "/scratch1/eh22/conda/envs/extasy3-gpu/lib/python2.7/site-packages/radical/entk/execman/resource_manager.py", line 368, in _submit_resource_reque
st                                                                                                                                                          raise Exception                                                       
Exception

Looks like error while loading shared libraries: libpython2.7.so.1.0: cannot open shared object file: No such file or directory rp.session.leonardo.rice.edu.eh22.017564.0000.zip

euhruska commented 6 years ago

used "ncsa.bw_aprun"

euhruska commented 6 years ago

it seems the submitted job didn't requested any GPUs, qstat gives

Resource_List.nodes = 8:ppn=32

euhruska commented 6 years ago

missing the :xk at the end of the line

vivek-bala commented 6 years ago

This is related to https://github.com/radical-cybertools/radical.pilot/issues/1546. I think there are some changes in the python module on BW. It is currently being worked over (https://github.com/radical-cybertools/radical.pilot/pull/1550). I think this issue will be encountered in the gpu and cpu-only versions we have here.

euhruska commented 6 years ago

my MD (with gpu) unit starts but doesn't do anything, I tried looking at the logfiles but couldn't pin the problem on anything, can someone look at them? rp.session.leonardo.rice.edu.eh22.017576.0001-remote.zip

andre-merzky commented 6 years ago

Several units finish ok, or so it seems:

$ tail -n 1 uni*/*OUT | grep App
Application 65079529 resources: utime ~0s, stime ~1s, Rss ~28408, inblocks ~7297, outblocks ~2289
Application 65079531 resources: utime ~1s, stime ~3s, Rss ~80100, inblocks ~40303, outblocks ~926
Application 65079534 resources: utime ~1s, stime ~3s, Rss ~41180, inblocks ~68272, outblocks ~11

But some also report errors:

$ tail -n 1 uni*/*ERR
==> unit.000009/STDERR <==
TypeError: an integer is required

==> unit.000010/STDERR <==
IOError: [Errno 2] No such file or directory: 'tmpha.gro'

Most importantly though, the application schedules non-mpi units which are larger than a node:

$ grep raise *log
agent_0.scheduling.0.child.log:    raise ValueError('Non-mpi unit does not fit onto single node')
agent_0.scheduling.0.child.log:    raise ValueError('Non-mpi unit does not fit onto single node')
agent_0.scheduling.0.child.log:    raise ValueError('Non-mpi unit does not fit onto single node')
agent_0.scheduling.0.child.log:    raise ValueError('Non-mpi unit does not fit onto single node')
agent_0.scheduling.0.child.log:    raise ValueError('Non-mpi unit does not fit onto single node')
agent_0.scheduling.0.child.log:    raise ValueError('Non-mpi unit does not fit onto single node')
agent_0.scheduling.0.child.log:    raise ValueError('Non-mpi unit does not fit onto single node')
agent_0.scheduling.0.child.log:    raise ValueError('Non-mpi unit does not fit onto single node')

HtH, Andre.

euhruska commented 6 years ago

Now that #44 is fixed, I have rerun this, the raise ValueError('Non-mpi unit does not fit onto single node') is the main problem, it happens in units with MD Simulations. I believe, it can't find fit the gpu requirement, since only xe nodes(no GPUs) instead of xk nodes (with GPUs) are requested. I checked with qstat, gives Resource_List.nodes = 8:ppn=32. Correctly it should be Resource_List.nodes = 8:ppn=32:xk How do I make sure that xk nodes are requested on bluewaters?

euhruska commented 6 years ago

@vivek-bala @andre-merzky Can you look at this? It looks like the resources are requested in the wrong queue. Should be gpu ="xk" queue, but currently is only cpu queue.

andre-merzky commented 6 years ago

@euhruska : sorry for the delay on this, I'll try to look into it tomorrow.

euhruska commented 6 years ago

@vivek-bala @andre-merzky any updates?

andre-merzky commented 6 years ago

Hey @euhruska , I finally got around to look at this. Would you please switch SAGA from feature/gpu to fix/issue_663, and give it a try? It should now add :xk on BW as soon as any GPUs are requested as part of the pilot description.

euhruska commented 6 years ago

it still shows Resource_List.nodes = 8:ppn=32 without :xk even though the requested resources in the MD stage are:

              sim_task.gpu_reqs = { 'processes': 1,
                                    'process_type': None,
                                    'threads_per_process': 1,
                                    'thread_type': None
                                }
              sim_task.cpu_reqs = { 'processes': 1,
                                    'process_type': None,
                                    'threads_per_process': 1,
                                    'thread_type': None
                                  }

euhruska commented 6 years ago

Reinstalled all radical components, got at first

2018-03-10 20:01:32,236: radical.entk.resource_manager.0000: MainProcess
             : MainThread     : ERROR   : Failed to validate resource descriptio
n, error: Error: Key cores does not exist in the resource description
Error: Error: Key cores does not exist in the resource description
Traceback (most recent call last):
  File "extasy_grlsd.py", line 353, in <module>
    appman.resource_manager = rman
  File "/scratch1/eh22/conda/envs/extasy3-gpu-2/lib/python2.7/site-packages/rad$
cal/entk/appman/appmanager.py", line 158, in resource_manager
    if self._resource_manager._validate_resource_desc(self._sid):
  File "/scratch1/eh22/conda/envs/extasy3-gpu-2/lib/python2.7/site-packages/rad$
cal/entk/execman/resource_manager.py", line 188, in _validate_resource_desc
    raise Error(text='Key %s does not exist in the resource description' % key)
Error: Error: Key cores does not exist in the resource description

added cores beside cpus and gpus, now

          res_dict = {
            'resource': Kconfig.REMOTE_HOST,
            'walltime': Kconfig.WALLTIME,
            'cores': Kconfig.PILOTSIZE,
            'cpus': Kconfig.PILOTSIZE,
            'gpus': Kconfig.PILOTSIZE/32,
            'project': Kconfig.ALLOCATION,
            'queue': Kconfig.QUEUE,
            'access_schema': 'gsissh'
          }

But still qstat -f gives nodes without xk.

andre-merzky commented 6 years ago

Thanks Eugene. I am back on BW now, too. Let me check.

euhruska commented 6 years ago

remote: rp.session.leonardo.rice.edu.eh22.017601.0004.zip

andre-merzky commented 6 years ago

Eugene - the following stack (edited) should get the pilots submitted with :xk flagged when GPUs are requested:

$ rs

  python               : 2.7.14
  pythonpath           : /opt/xalt/0.7.6/sles11.3/libexec
  virtualenv           : /mnt/a/u/sciteam/merzky/radical/radical.pilot/ve

  radical.pilot        : 0.47-0.47-152-g44010335@feature-gpu
  radical.utils        : 0.47.1-v0.47.1-14-gdfd3df5@devel
  saga                 : 0.47-v0.46-71-g207b66e3@fix-issue_663

For testing, I used examples/misc/gpu_pilot.py, where I set pd.gpus = 2. SAGA was setting ppn=32, that is now corrected to ppn=16, which seems required for XK nodes.

euhruska commented 6 years ago

is the radical.pilot branch feature/gpu?

euhruska commented 6 years ago

I can't find or activate /mnt/a/u/sciteam/merzky/radical/radical.pilot/ve.

euhruska commented 6 years ago

when installing these branches myself I still get nodes without :xk

andre-merzky commented 6 years ago

Yes, that would need the feature/GPU stack in RP, as the pilot can otherwise not express the need for GPUs. Can you send me the output of radical-stack, please? In the test you used, can you please check that the pilot indeed requests GPUs? If it does, please run again with these settings:

export RADICAL_SAGA_VERBOSE=debug
export RADICAL_SAGA_PTY_VERBOSE=debug
export RADICAL_SAGA_LOG_TGT=rs.log
export RADICAL_SAGA_PTY_LOG_TGT=rs.log

and attach the resulting rs.log, and the script you are running. Thanks!

euhruska commented 6 years ago

radical-stack:


  python               : 2.7.14
  pythonpath           :
  virtualenv           : extasy3-gpu-3

  radical.analytics    : v0.45.2-101-g8358b08@devel
  radical.entk         : 0.6.1-entk-0.5-306-ga1887fc@HEAD-detached-at-a1887fc
  radical.pilot        : 0.47-0.47-118-gf66e2f6d@feature-gpu
  radical.utils        : 0.47.1-v0.47.1-14-gdfd3df5@devel
  saga                 : 0.47-v0.46-71-g207b66e3@HEAD-detached-at-207b66e3

I does not request gpus, at least the qstat -f does not show that

euhruska commented 6 years ago

the script I'm running is https://github.com/radical-collaboration/extasy-grlsd/blob/master/extasy_grlsd.py with python extasy_grlsd.py --Kconfig settings_ala12-gpu.wcfg

andre-merzky commented 6 years ago

@vivek-bala : Vivek, I don't see a GPU mentioned in that settings file. Can you confirm if or if not the pilot will request GPUs in this case? @euhruska : can you please attach either the SAGA logs, or pmgr.0000.launching.0.child.log from the RP session directory? Thanks!

vivek-bala commented 6 years ago

Hey Eugen, it seems like you are using the latest master of EnTK. You have to use the latest feature/gpu branch in EnTK.

vivek-bala commented 6 years ago

Hey Andre, yes I do see the use_gpus variable in the config file. I think it was a mismatch on the branches. IIUC: Eugen needs to use (always) feature/gpu branch in EnTK, RP and Saga and devel of utils.

andre-merzky commented 6 years ago

settings_ala12-gpu.wcfg does not exist in the GPU branch though. Can you help Eugene to make sure he uses a suitable config file to trigger the GPU allocation request? Thanks!

euhruska commented 6 years ago

I'm not using the gpu branch of extasy-grlsd, updating the master

euhruska commented 6 years ago

ok, will reinstall EnTK, RP and saga

vivek-bala commented 6 years ago

Oh yea, I was looking at master since the script linked above is from master. Eugen, try the stack I mentioned with the gpu-compatible branch/files in extasy-grlsd. Let us know how it goes.

euhruska commented 6 years ago

In the mean time rs.log and pmgr.0000.launching.0.child.log before with the last radical-stack pmgr.0000.launching.0.child.log rs.log

euhruska commented 6 years ago

but got radical-stack with some HEAD detached, is that correct

radical-stack

  python               : 2.7.14
  pythonpath           :
  virtualenv           : extasy3-gpu-3

  radical.analytics    : v0.45.2-101-g8358b08@devel
  radical.entk         : 0.6.1-entk-0.5-310-g301af1d@HEAD-detached-at-301af1d
  radical.pilot        : 0.47-0.47-152-g44010335@HEAD-detached-at-44010335
  radical.utils        : 0.47.1-v0.47.1-14-gdfd3df5@devel
  saga                 : 0.47-v0.46-70-g23bd68a3@HEAD-detached-at-23bd68a3

I used:

pip install git+https://github.com/radical-cybertools/radical.utils.git@devel
pip install git+https://github.com/radical-cybertools/saga-python.git@feature/gpu
pip install git+https://github.com/radical-cybertools/radical.pilot.git@feature/gpu
pip install git+https://github.com/radical-cybertools/radical.entk.git@feature/gpu
pip install git+https://github.com/radical-cybertools/radical.analytics@devel

euhruska commented 6 years ago

the above stack gives still no :xk nodes, are the above commands correct?

radical-collaboration / extasy-grlsd

gpu/xk nodes settings #22