Closed euhruska closed 6 years ago
Ah using GPUs requires an unreleased version of RP and a different branch of EnTK. There isn't too much documentation currently on it. I'll translate the CPU version into a GPU one. Can you point me to the branch I should work on?
openmm branch
Can you give me a timeline of when this would be required? Is it ok if this is available on on the 29th or so?
29th is fine
how does the GPU setup look like?
Added the GPU enabled script now. Take a look. I added comments at the beginning of the script to describe the changes.
Note that you have to use the following branches (the order of installation is important):
radical.utils - devel saga - feature/gpu radical.pilot - feature/gpu radical.entk - feature/gpu
I have changed the installation instructions in the README.
Let me know how it goes.
should I use 'ncsa.bw' or 'ncsa.bw_aprun'?
I would start with ncsa.bw_aprun
currently 'ncsa.bw_aprun' gives an error even without gpu, https://github.com/radical-collaboration/extasy-grlsd/issues/44, but I will test once it works again
how do I have to change the resource description?
2018-02-02 11:07:54,988: radical.entk.resource_manager: MainProcess : MainThread : ERROR : Failed to validate resource description, error: Error: Mandatory key cpus does not exist in the resource description
Error: Error: Mandatory key cpus does not exist in the resource description
Traceback (most recent call last):
File "extasy_grlsd.py", line 301, in <module>
rman = ResourceManager(res_dict)
File "x/conda/envs/extasy3-gpu/lib/python2.7/site-packages/radical/entk/execman/resource_manager.py", line 48, in __init__
if self._validate_resource_desc(resource_desc):
File "x/conda/envs/extasy3-gpu/lib/python2.7/site-packages/radical/entk/execman/resource_manager.py", line 228, in _validate_resource_desc
raise Error(text='Mandatory key %s does not exist in the resource description'%key)
Error: Error: Mandatory key cpus does not exist in the resource description
The resource dictionary is different for the gpu branch. See the example at the top of the script: https://github.com/radical-collaboration/extasy-grlsd/blob/feature/gpu/extasy_grlsd2.py#L52-L62. You need to explicitly mention the 'cpus' and 'gpus' to acquire.
ah, missed the branch
Do I understand correctly that he the openmm task be assigned 16 cpus and 1 gpu if I don't set cpu_reqs only sim_task.gpu_reqs = { 'processes': 1, 'process_type': 'MPI', 'threads_per_process': 16, 'thread_type': 'OpenMP' } ?
how to define mq_connection?
2018-02-02 14:38:51,298: radical.entk.appmanager: MainProcess : MainThread : ERROR : Error setting RabbitMQ system: global name 'mq_connection' is not defined
2018-02-02 14:38:51,298: radical.entk.appmanager: MainProcess : MainThread : ERROR : Error in AppManager
Traceback (most recent call last):
File "/scratch1/eh22/conda/envs/extasy3-gpu/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 225, in run
setup = self._setup_mqs()
File "/scratch1/eh22/conda/envs/extasy3-gpu/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 557, in _setup_mqs
self._mq_channel = mq_connection.channel()
NameError: global name 'mq_connection' is not defined
NameError: global name 'mq_connection' is not defined
Fixed that. Please pull and reinstall entk.
Do I understand correctly that he the openmm task be assigned 16 cpus and 1 gpu if I don't set cpu_reqs only sim_task.gpu_reqs = { 'processes': 1, 'process_type': 'MPI', 'threads_per_process': 16, 'thread_type': 'OpenMP' } ?
No. You have to explicitly mention the cpu reqs.
sim_task.gpu_reqs = { 'processes': 1, 'process_type': 'MPI', 'threads_per_process': 16, 'thread_type': 'OpenMP' }
means 16 gpus (processes*threads_per_process) for each sim_task. If you require 16 cpus and 1 gpu for your task, your task description needs to look like ():
sim_task.cpu_reqs = { 'processes': 1, 'process_type': None, 'threads_per_process': 16, 'thread_type': 'OpenMP' } # if the simulation is going to spawn 1 process with 16 threads (i.e. non-mpi)
sim_task.cpu_reqs = { 'processes': 1, 'process_type': None, 'threads_per_process': 16, 'thread_type': 'OpenMP' } # if the simulation is going to spawn 16 process with 1 threads (i.e. mpi)
sim_task.gpu_reqs = { 'processes': 1, 'process_type': None, 'threads_per_process': 1, 'thread_type': None}
Does that help?
helpful, with openmm I intend to use 1 cpu, 1 gpu
2018-02-02 15:09:45,721: radical.entk.resource_manager: MainProcess : pmgr.0000.subscriber._state_sub_cb: ERROR : Pilot has failed
2018-02-02 15:09:45,739: radical.entk.resource_manager: MainProcess : MainThread : ERROR : Resource request submission failed 2018-02-02 15:09:45,740: radical.entk.appmanager: MainProcess : MainThread : ERROR : Error in AppManager
Traceback (most recent call last): File "/scratch1/eh22/conda/envs/extasy3-gpu/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 242, in run
self._resource_manager._submit_resource_request() File "/scratch1/eh22/conda/envs/extasy3-gpu/lib/python2.7/site-packages/radical/entk/execman/resource_manager.py", line 368, in _submit_resource_reque
st raise Exception
Exception
Looks like error while loading shared libraries: libpython2.7.so.1.0: cannot open shared object file: No such file or directory
rp.session.leonardo.rice.edu.eh22.017564.0000.zip
is that related to https://github.com/radical-collaboration/extasy-grlsd/issues/44?
used "ncsa.bw_aprun"
it seems the submitted job didn't requested any GPUs, qstat
gives
Resource_List.nodes = 8:ppn=32
missing the :xk at the end of the line
This is related to https://github.com/radical-cybertools/radical.pilot/issues/1546. I think there are some changes in the python module on BW. It is currently being worked over (https://github.com/radical-cybertools/radical.pilot/pull/1550). I think this issue will be encountered in the gpu and cpu-only versions we have here.
my MD (with gpu) unit starts but doesn't do anything, I tried looking at the logfiles but couldn't pin the problem on anything, can someone look at them? rp.session.leonardo.rice.edu.eh22.017576.0001-remote.zip
Several units finish ok, or so it seems:
$ tail -n 1 uni*/*OUT | grep App
Application 65079529 resources: utime ~0s, stime ~1s, Rss ~28408, inblocks ~7297, outblocks ~2289
Application 65079531 resources: utime ~1s, stime ~3s, Rss ~80100, inblocks ~40303, outblocks ~926
Application 65079534 resources: utime ~1s, stime ~3s, Rss ~41180, inblocks ~68272, outblocks ~11
But some also report errors:
$ tail -n 1 uni*/*ERR
==> unit.000009/STDERR <==
TypeError: an integer is required
==> unit.000010/STDERR <==
IOError: [Errno 2] No such file or directory: 'tmpha.gro'
Most importantly though, the application schedules non-mpi units which are larger than a node:
$ grep raise *log
agent_0.scheduling.0.child.log: raise ValueError('Non-mpi unit does not fit onto single node')
agent_0.scheduling.0.child.log: raise ValueError('Non-mpi unit does not fit onto single node')
agent_0.scheduling.0.child.log: raise ValueError('Non-mpi unit does not fit onto single node')
agent_0.scheduling.0.child.log: raise ValueError('Non-mpi unit does not fit onto single node')
agent_0.scheduling.0.child.log: raise ValueError('Non-mpi unit does not fit onto single node')
agent_0.scheduling.0.child.log: raise ValueError('Non-mpi unit does not fit onto single node')
agent_0.scheduling.0.child.log: raise ValueError('Non-mpi unit does not fit onto single node')
agent_0.scheduling.0.child.log: raise ValueError('Non-mpi unit does not fit onto single node')
HtH, Andre.
Now that #44 is fixed, I have rerun this, the raise ValueError('Non-mpi unit does not fit onto single node')
is the main problem, it happens in units with MD Simulations.
I believe, it can't find fit the gpu requirement, since only xe nodes(no GPUs) instead of xk nodes (with GPUs) are requested. I checked with qstat
, gives Resource_List.nodes = 8:ppn=32
. Correctly it should be Resource_List.nodes = 8:ppn=32:xk
How do I make sure that xk nodes are requested on bluewaters?
@vivek-bala @andre-merzky Can you look at this? It looks like the resources are requested in the wrong queue. Should be gpu ="xk" queue, but currently is only cpu queue.
@euhruska : sorry for the delay on this, I'll try to look into it tomorrow.
@vivek-bala @andre-merzky any updates?
Hey @euhruska , I finally got around to look at this. Would you please switch SAGA from feature/gpu
to fix/issue_663
, and give it a try? It should now add :xk
on BW as soon as any GPUs are requested as part of the pilot description.
it still shows Resource_List.nodes = 8:ppn=32
without :xk
even though the requested resources in the MD stage are:
sim_task.gpu_reqs = { 'processes': 1,
'process_type': None,
'threads_per_process': 1,
'thread_type': None
}
sim_task.cpu_reqs = { 'processes': 1,
'process_type': None,
'threads_per_process': 1,
'thread_type': None
}
Reinstalled all radical components, got at first
2018-03-10 20:01:32,236: radical.entk.resource_manager.0000: MainProcess
: MainThread : ERROR : Failed to validate resource descriptio
n, error: Error: Key cores does not exist in the resource description
Error: Error: Key cores does not exist in the resource description
Traceback (most recent call last):
File "extasy_grlsd.py", line 353, in <module>
appman.resource_manager = rman
File "/scratch1/eh22/conda/envs/extasy3-gpu-2/lib/python2.7/site-packages/rad$
cal/entk/appman/appmanager.py", line 158, in resource_manager
if self._resource_manager._validate_resource_desc(self._sid):
File "/scratch1/eh22/conda/envs/extasy3-gpu-2/lib/python2.7/site-packages/rad$
cal/entk/execman/resource_manager.py", line 188, in _validate_resource_desc
raise Error(text='Key %s does not exist in the resource description' % key)
Error: Error: Key cores does not exist in the resource description
added cores beside cpus and gpus, now
res_dict = {
'resource': Kconfig.REMOTE_HOST,
'walltime': Kconfig.WALLTIME,
'cores': Kconfig.PILOTSIZE,
'cpus': Kconfig.PILOTSIZE,
'gpus': Kconfig.PILOTSIZE/32,
'project': Kconfig.ALLOCATION,
'queue': Kconfig.QUEUE,
'access_schema': 'gsissh'
}
But still qstat -f gives nodes without xk.
Thanks Eugene. I am back on BW now, too. Let me check.
Eugene - the following stack (edited) should get the pilots submitted with :xk
flagged when GPUs are requested:
$ rs
python : 2.7.14
pythonpath : /opt/xalt/0.7.6/sles11.3/libexec
virtualenv : /mnt/a/u/sciteam/merzky/radical/radical.pilot/ve
radical.pilot : 0.47-0.47-152-g44010335@feature-gpu
radical.utils : 0.47.1-v0.47.1-14-gdfd3df5@devel
saga : 0.47-v0.46-71-g207b66e3@fix-issue_663
For testing, I used examples/misc/gpu_pilot.py
, where I set pd.gpus = 2
. SAGA was setting ppn=32
, that is now corrected to ppn=16
, which seems required for XK nodes.
is the radical.pilot branch feature/gpu?
I can't find or activate /mnt/a/u/sciteam/merzky/radical/radical.pilot/ve.
when installing these branches myself I still get nodes without :xk
Yes, that would need the feature/GPU
stack in RP, as the pilot can otherwise not express the need for GPUs. Can you send me the output of radical-stack
, please? In the test you used, can you please check that the pilot indeed requests GPUs? If it does, please run again with these settings:
export RADICAL_SAGA_VERBOSE=debug
export RADICAL_SAGA_PTY_VERBOSE=debug
export RADICAL_SAGA_LOG_TGT=rs.log
export RADICAL_SAGA_PTY_LOG_TGT=rs.log
and attach the resulting rs.log
, and the script you are running. Thanks!
radical-stack:
python : 2.7.14
pythonpath :
virtualenv : extasy3-gpu-3
radical.analytics : v0.45.2-101-g8358b08@devel
radical.entk : 0.6.1-entk-0.5-306-ga1887fc@HEAD-detached-at-a1887fc
radical.pilot : 0.47-0.47-118-gf66e2f6d@feature-gpu
radical.utils : 0.47.1-v0.47.1-14-gdfd3df5@devel
saga : 0.47-v0.46-71-g207b66e3@HEAD-detached-at-207b66e3
I does not request gpus, at least the qstat -f does not show that
the script I'm running is https://github.com/radical-collaboration/extasy-grlsd/blob/master/extasy_grlsd.py with python extasy_grlsd.py --Kconfig settings_ala12-gpu.wcfg
@vivek-bala : Vivek, I don't see a GPU mentioned in that settings file. Can you confirm if or if not the pilot will request GPUs in this case?
@euhruska : can you please attach either the SAGA logs, or pmgr.0000.launching.0.child.log
from the RP session directory? Thanks!
Hey Eugen, it seems like you are using the latest master of EnTK. You have to use the latest feature/gpu branch in EnTK.
Hey Andre, yes I do see the use_gpus variable in the config file. I think it was a mismatch on the branches. IIUC: Eugen needs to use (always) feature/gpu branch in EnTK, RP and Saga and devel of utils.
settings_ala12-gpu.wcfg
does not exist in the GPU branch though. Can you help Eugene to make sure he uses a suitable config file to trigger the GPU allocation request? Thanks!
I'm not using the gpu branch of extasy-grlsd, updating the master
ok, will reinstall EnTK, RP and saga
Oh yea, I was looking at master since the script linked above is from master. Eugen, try the stack I mentioned with the gpu-compatible branch/files in extasy-grlsd. Let us know how it goes.
In the mean time rs.log and pmgr.0000.launching.0.child.log before with the last radical-stack pmgr.0000.launching.0.child.log rs.log
but got radical-stack with some HEAD detached, is that correct
radical-stack
python : 2.7.14
pythonpath :
virtualenv : extasy3-gpu-3
radical.analytics : v0.45.2-101-g8358b08@devel
radical.entk : 0.6.1-entk-0.5-310-g301af1d@HEAD-detached-at-301af1d
radical.pilot : 0.47-0.47-152-g44010335@HEAD-detached-at-44010335
radical.utils : 0.47.1-v0.47.1-14-gdfd3df5@devel
saga : 0.47-v0.46-70-g23bd68a3@HEAD-detached-at-23bd68a3
I used:
pip install git+https://github.com/radical-cybertools/radical.utils.git@devel
pip install git+https://github.com/radical-cybertools/saga-python.git@feature/gpu
pip install git+https://github.com/radical-cybertools/radical.pilot.git@feature/gpu
pip install git+https://github.com/radical-cybertools/radical.entk.git@feature/gpu
pip install git+https://github.com/radical-cybertools/radical.analytics@devel
the above stack gives still no :xk nodes, are the above commands correct?
To use OpenMM with GPUs one needs xk nodes, what would be the correct settings inresource_config.rcfg to get xk nodes? Is it simply PILOTSIZE = 256:xk?