Closed jro1234 closed 6 years ago
Can you please send the output of radical-stack
? You should be on devel
or project/adaptivemd
(they are the same right now) on saga-python
and radical.utils
, too.
I am actually on the project/adaptivemd_gpu_am
branch, I will try again in a minute here after switching to project/adaptivemd
. I'm pretty sure in my slew of tests, I did use the 'plain' adaptivemd branch and encountered the same error, but am not 100% on that.
python : 2.7.9
pythonpath : /sw/titan/.swci/0-login/opt/spack/20180315/linux-suse_linux11-x86_64/gcc-4.3.4/python-2.7.9-v6ctjewwdx6k2qs7ublexz7gnx457jo5/lib/python2.7/site-packages:/sw/xk6/xalt/0.7.5/site:/sw/xk6/xalt/0.7.5/libexec
virtualenv : /lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv
radical.analytics : v0.45.2-101-g8358b08@devel
radical.pilot : 0.47.10-v0.47.10-178-g769972a3@project-adaptivemd_gpu_am
radical.utils : 0.47-v0.47-5-g4573dd7@project-adaptivemd
saga : 0.47-v0.46-38-g8b602364@project-adaptivemd
It is the same error with this stack, I made a type when fetching logs last time, so here's updated logs too. Thanks!
python : 2.7.9
pythonpath : /sw/titan/.swci/0-login/opt/spack/20180315/linux-suse_linux11-x86_64/gcc-4.3.4/python-2.7.9-v6ctjewwdx6k2qs7ublexz7gnx457jo5/lib/python2.7/site-packages:/sw/xk6/xalt/0.7.5/site:/sw/xk6/xalt/0.7.5/libexec
virtualenv : /lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv
radical.analytics : v0.45.2-101-g8358b08@devel
radical.pilot : 0.47.5-v0.47.5-170-g747b12d1@project-adaptivemd
radical.utils : 0.47-v0.47-5-g4573dd7@project-adaptivemd
saga : 0.47-v0.46-38-g8b602364@project-adaptivemd
@andre-merzky Any updates here?
Hey @jrossyra , sorry for the late update, its a slow week over here... On what resource are you using RP with those experiments?
Thanks, Andre.
Its on Titan, I understand about the lag! Its the titan_orte config.
Andre, I also wanted to mention that when I chose which branch to originally test on, I believe I was looking at the latest commit. I just double checked, and project/adaptivemd_gpu_am
is the most recently updated branch. Although as mentioned above, I'm getting the same error in the case of each branch.
I think you are on the right branch in SAGA, but on an outdated commit (8b602364
). Can you please try upgrading all three repos and see if that solves the problem?
Your first stack pasted above is correct in terms of branches (RP: project/adaptivemd_gpu_am
, RS and RU: project/adaptivemd
Thats embarrassing :\
Just to confirm, the install order is 1) utils, 2) saga, 3) pilot right?
It looks like there's a new dependency, the bitarray package. There is no way to install packages with pip on titan right now using their python module since the SSL upgrade to PyPI, so I can't get the packages to import. I've been treading as lightly as possible since then because I'm trying to avoid breaking my setup, if there's a way I can work around this let me know. I'm not fancy enough to have an idea how to work around this one, so I've been using --no-deps and --no-index to get my repos to reinstall at all, otherwise I error out.
Yes, the order of the packages is correct. I ran into the same error on Titan, you need to use the python_anaconda module. I have tested it with RP and seems to be working as intended.
Sorry for that - the bitarray dependency is now fixed, please give it another try... Thanks!
Reinstalling now. Had to save some changes to my execution scripts, hopefully reinstall works smoothly under the anaconda module and I'll post back soon.
I have a different attribute failure with the gpu attribute deactivated (not passed to computepilotdescription).
I ran a second with the gpus attribute, and had the mpi attribute error come up first.
Hey John - thanks for the logs! I'm afraid I am not really sure what to make of them though: I only see one attribute related message, which is:
Skipping adaptor saga.adaptors.context.myproxy: loading failed: ''Adaptor' object has no attribute 'opts''
which is a warning and indeed should not affect your run. But the pilot seems to be canceled, so I assume there was an error somewhere. Can you please also attach the stdout / stderr of your application, or any other information you have about the failure mode? Please also attach the outputof radical-stack once more.
Thanks!
Yes, I should have done these things!
python : 2.7.12
pythonpath : /sw/xk6/xalt/0.7.5/site:/sw/xk6/xalt/0.7.5/libexec
virtualenv : /lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv
radical.pilot : 0.47.10-v0.47.10-178-g769972a3@project-adaptivemd_gpu_am
radical.utils : 0.47.4-merge-pre_gpu-20-g3e0240f@project-adaptivemd
saga : 0.47.3-merge-pre_gpu-31-g8af8a223@project-adaptivemd
I didn't realize the error didn't propogate, here's the output on the screen. I'll go hunting for more, just wanted to get this over quickly.
new session: [rp.session.titan-ext1.jrossyra.017662.0001] \
database : [mongodb://160.91.205.198:27017/rp] ok
create pilot manager ok
submit 1 pilot(s)
. ok
create unit manager ok
add 1 pilot(s) ok
Traceback (most recent call last):
File "/lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv/lib/python2.7/site-packages/adaptivemd/rp/utils.py", line 246, in create_cud_from_task_def
cud = generate_trajectorygenerationtask_cud(task_desc, db, shared_path, project)
File "/lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv/lib/python2.7/site-packages/adaptivemd/rp/utils.py", line 386, in generate_trajectorygenerationtask_cud
cud.mpi = False
File "/lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv/lib/python2.7/site-packages/saga/attributes.py", line 2643, in __setattr__
return self._attributes_i_set (key, val, flow=self._DOWN)
File "/lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv/lib/python2.7/site-packages/saga/attributes.py", line 1068, in _attributes_i_set
raise se.IncorrectState ("attribute set is not extensible/private (key %s)" % key)
IncorrectState: attribute set is not extensible/private (key mpi) (/lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv/lib/python2.7/site-packages/saga/attributes.py +1068 (_attributes_i_set) : raise se.IncorrectState ("attribute set is not extensible/private (key %s)" % key))
2018-05-11 14:09:11,047: client.rp : Process-2 : MainThread : ERROR : Client process failed, error: Error: attribute set is not extensible/private (key mpi) (/lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv/lib/python2.7/site-packages/saga/attributes.py +1068 (_attributes_i_set) : raise se.IncorrectState ("attribute set is not extensible/private (key %s)" % key))
Traceback (most recent call last):
File "/lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv/lib/python2.7/site-packages/adaptivemd/rp/client.py", line 167, in _runme
cuds = create_cud_from_task_def(task_descs, self._db, resource_desc_for_pilot['shared_path'], self._project)
File "/lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv/lib/python2.7/site-packages/adaptivemd/rp/utils.py", line 259, in create_cud_from_task_def
raise Error(msg=ex)
Error: Error: attribute set is not extensible/private (key mpi) (/lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv/lib/python2.7/site-packages/saga/attributes.py +1068 (_attributes_i_set) : raise se.IncorrectState ("attribute set is not extensible/private (key %s)" % key))
wait for 1 pilot(s)
ok
closing session rp.session.titan-ext1.jrossyra.017662.0001 \
close unit manager ok
close pilot manager \
wait for 1 pilot(s)
timeout
ok
+ rp.session.titan-ext1.jrossyra.017662.0001 (json)
- pilot.0000 (profiles)
+ pilot.0000 (logfiles)
session lifetime: 295.6s ok
Ah, that already helps, and points to a problem: RP indeed should support the mpi
attribute for backward compatibility! This is an easy fix fortunately - brb :-) Thanks!
Ok very good! I also just noticed it was at the actual line where the attribute is set, which I should have seen before. Thanks!
This is pushed now (to RP).
Testing now, thanks! (looks like I should switch to the devel branch, I'll stay there unless you indicate otherwise)
It looks like all the objects are happy to exist now, however the CU's don't progress past the AGENT_SCHEDULING_PENDING state. Here are the logs: rp.session.tar.gz
And just a snippet of printout, including from my unit callback function:
new session: [rp.session.titan-ext7.jrossyra.017662.0006] \
database : [mongodb://160.91.205.208:27017/rp] ok
create pilot manager ok
{'cores': 192,
'gpus': 11,
'project': 'bip149',
'queue': 'batch',
'resource': 'ornl.titan_orte',
'runtime': 15}
submit 1 pilot(s)
. ok
create unit manager ok
add 1 pilot(s) ok
attribute key / property name 'mpi' is deprecated - use 'cpu_process_type'
attribute key / property name 'cores' is deprecated - use 'cpu_processes'
attribute key / property name 'mpi' is deprecated - use 'cpu_process_type'
attribute key / property name 'cores' is deprecated - use 'cpu_processes'
attribute key / property name 'mpi' is deprecated - use 'cpu_process_type'
attribute key / property name 'cores' is deprecated - use 'cpu_processes'
attribute key / property name 'mpi' is deprecated - use 'cpu_process_type'
attribute key / property name 'cores' is deprecated - use 'cpu_processes'
attribute key / property name 'mpi' is deprecated - use 'cpu_process_type'
attribute key / property name 'cores' is deprecated - use 'cpu_processes'
[<radical.pilot.compute_unit_description.ComputeUnitDescription object at 0x2b91d7c58790>,
<radical.pilot.compute_unit_description.ComputeUnitDescription object at 0x2b91d7c588d0>,
<radical.pilot.compute_unit_description.ComputeUnitDescription object at 0x2b91d7c58b10>,
<radical.pilot.compute_unit_description.ComputeUnitDescription object at 0x2b91d7c58d10>,
<radical.pilot.compute_unit_description.ComputeUnitDescription object at 0x2b91d7c58f10>]
submit 5 unit(s)
..... ok
CALLBACK state: unit.000000 UMGR_SCHEDULING_PENDING
CALLBACK state: unit.000001 UMGR_SCHEDULING_PENDING
CALLBACK state: unit.000002 UMGR_SCHEDULING_PENDING
CALLBACK state: unit.000003 UMGR_SCHEDULING_PENDING
CALLBACK state: unit.000004 UMGR_SCHEDULING_PENDING
CALLBACK state: unit.000000 UMGR_SCHEDULING
CALLBACK state: unit.000001 UMGR_SCHEDULING
CALLBACK state: unit.000002 UMGR_SCHEDULING
CALLBACK state: unit.000003 UMGR_SCHEDULING
CALLBACK state: unit.000004 UMGR_SCHEDULING
CALLBACK state: unit.000000 UMGR_STAGING_INPUT_PENDING
CALLBACK state: unit.000001 UMGR_STAGING_INPUT_PENDING
CALLBACK state: unit.000002 UMGR_STAGING_INPUT_PENDING
CALLBACK state: unit.000003 UMGR_STAGING_INPUT_PENDING
CALLBACK state: unit.000004 UMGR_STAGING_INPUT_PENDING
CALLBACK state: unit.000000 UMGR_STAGING_INPUT
CALLBACK state: unit.000001 UMGR_STAGING_INPUT
CALLBACK state: unit.000002 UMGR_STAGING_INPUT
CALLBACK state: unit.000003 UMGR_STAGING_INPUT
CALLBACK state: unit.000004 UMGR_STAGING_INPUT
CALLBACK state: unit.000000 AGENT_STAGING_INPUT_PENDING
CALLBACK state: unit.000001 AGENT_STAGING_INPUT_PENDING
CALLBACK state: unit.000002 AGENT_STAGING_INPUT_PENDING
CALLBACK state: unit.000003 AGENT_STAGING_INPUT_PENDING
CALLBACK state: unit.000004 AGENT_STAGING_INPUT_PENDING
CALLBACK state: unit.000000 AGENT_STAGING_INPUT
CHECKER is sleeping
CALLBACK state: unit.000003 AGENT_STAGING_INPUT
CALLBACK state: unit.000003 AGENT_SCHEDULING_PENDING
CALLBACK state: unit.000002 AGENT_STAGING_INPUT
CALLBACK state: unit.000002 AGENT_SCHEDULING_PENDING
CALLBACK state: unit.000001 AGENT_STAGING_INPUT
CALLBACK state: unit.000001 AGENT_SCHEDULING_PENDING
CALLBACK state: unit.000000 AGENT_SCHEDULING_PENDING
CALLBACK state: unit.000004 AGENT_STAGING_INPUT
CALLBACK state: unit.000004 AGENT_SCHEDULING_PENDING
CHECKER is sleeping
CHECKER is sleeping
CHECKER is sleeping
...
CHECKER is sleeping
2018-05-11 16:35:39,100: resource_manager.rp : Process-2 : pmgr.0000.subscriber._state_sub_cb: ERROR : Pilot has failed
That is at the end of PBS job. My workflow persists past this, so I did a keyboard interrupt after than Let me know what other output woudl be useful. stack for the record:
python : 2.7.12
pythonpath : /sw/xk6/xalt/0.7.5/site:/sw/xk6/xalt/0.7.5/libexec
virtualenv : /lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv
radical.analytics : 0.47.0
radical.pilot : 0.47.10-v0.47.10-189-gf248cfa5@project-adaptivemd_gpu_am
radical.utils : 0.47.4-merge-pre_gpu-20-g3e0240f@project-adaptivemd
saga : 0.47.3-merge-pre_gpu-31-g8af8a223@project-adaptivemd
Hmm, this is strange: the pilot agent seems to come up all right. It pulls the unit and seems to go through the input staging part, too - but then nothing. There seems to be some logfiles missing, specifically the agent_0.scheduling.0.child.log
, but also others, and there are also no .out
and .err
files in the pilot sandbox - so its really hard to tell whats happening. DO you have any idea why those logfiles might be missing?
I am not sure... I will inspect everything more closely, and I can upload more from the agent directory.
Yes, please do pack the agent sandbox - thanks!
I noticed that I had my environment sourceing commented out in the bashrc, so I ran another workflow with this fixed. This has always been a fixture of my platform. However it looks like the same (lack of) errors, here are session and agent folders. The agent_0 scheduling file is still missing. (identical stack) pilot.tar.gz rp.session.tar.gz
Thanks John - we are finally looking at differences between Mongo 2 and 3, what the branch was all about :-) I pushed another commit to RP which should avoid this problem (a failing update
call stalled the agent)
@andre-merzky I have a different version of the workflow logs in the slack, however I thought I'd run with debug on incase that helps troubleshoot the current issue. The error is still that the scheduler sees each CU as not fitting on a single node. Let me know if there's any further info I can provide to troubleshoot.
Traceback (most recent call last):
File "/lustre/atlas/scratch/jrossyra/bip149/radical.pilot.sandbox/rp.session.titan-ext6.jrossyra.017665.0000/pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 1264, in work_cb
self._workers[state](things)
File "/lustre/atlas/scratch/jrossyra/bip149/radical.pilot.sandbox/rp.session.titan-ext6.jrossyra.017665.0000/pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/agent/scheduler/base.py", line 425, in _schedule_units
if self._try_allocation(unit):
File "/lustre/atlas/scratch/jrossyra/bip149/radical.pilot.sandbox/rp.session.titan-ext6.jrossyra.017665.0000/pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/agent/scheduler/base.py", line 449, in _try_allocation
unit['slots'] = self._allocate_slot(unit['description'])
File "/lustre/atlas/scratch/jrossyra/bip149/radical.pilot.sandbox/rp.session.titan-ext6.jrossyra.017665.0000/pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/agent/scheduler/continuous.py", line 129, in _allocate_slot
slots = self._alloc_nompi(cud)
File "/lustre/atlas/scratch/jrossyra/bip149/radical.pilot.sandbox/rp.session.titan-ext6.jrossyra.017665.0000/pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/agent/scheduler/continuous.py", line 291, in _alloc_nompi
raise ValueError('Non-mpi unit does not fit onto single node')
I have these attributes for the CUs. I've tried setting the cpu_process_type
, but seems it always reverts back to False
:
'cpu_process_type': 'False',
'cpu_processes': 1,
'cpu_thread_type': 'POSIX',
'cpu_threads': 1,
'environment': {'OPENMM_CPU_THREADS': '1',
'OPENMM_CUDA_COMPILER': '`which nvcc`'},
'executable': 'python',
'gpu_process_type': 'POSIX',
'gpu_processes': 1,
'gpu_thread_type': 'CUDA',
'gpu_threads': 1,
My stack is:
python : 2.7.12
pythonpath : /sw/xk6/xalt/0.7.5/site:/sw/xk6/xalt/0.7.5/libexec
virtualenv : /lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv
radical.analytics : 0.47.0
radical.pilot : 0.47.10-v0.47.10-190-g87049544@project-adaptivemd_gpu_am
radical.utils : 0.47.4-merge-pre_gpu-20-g3e0240f@project-adaptivemd
saga : 0.47.3-merge-pre_gpu-31-g8af8a223@project-adaptivemd
@andre-merzky @mturilli @vivek-bala Just want to ping on this issue. I've tested a couple different option configurations from the gpu examples and my own, if I add gpu attributes to the cud I seem to always get this error.
Hey John - I had to open a titan ticket on a pip problem which stopped me from reproducing your problem. Alas, that ticket is still open. I pinged them again, and will report back as soon as I hear something.
@vivek-bala suggested for me to use the python_anaconda
module on Titan to resolve the SSL version issue with pip, if this is the problem, it might work put with a direct swap in the config files or wherever is relevant. When I asked the Titan folks, they didn't make this suggestion which would have been super helpful.
Hi @jrossyra,
the python stack on Titan is now functional again. I can run the code below against this stack:
python : 2.7.9
pythonpath : /sw/titan/.swci/0-login/opt/spack/20180315/linux-suse_linux11-x86_64/gcc-4.3.4/python-2.7.9-v6ctjewwdx6k2qs7ublexz7gnx457jo5/lib/python2.7/site-packages:/sw/xk6/xalt/0.7.5/site:/sw/xk6/xalt/0.7.5/libexec
virtualenv : /autofs/nccs-svm1_home1/merzky1/radical/ve.jo
radical.pilot : 0.47.13-v0.47.13-196-gdde57892@project-adaptivemd_gpu_am
radical.utils : 0.47.5-v0.47.5-26-gb254b8b@project-adaptivemd
saga : 0.47.6-v0.47.6-37-g7e6a1411@project-adaptivemd
This stack is slightly different from yours for all three layers, as I merged a number of fixes over the last days. The test code is:
#!/usr/bin/env python
import radical.pilot as rp
if __name__ == '__main__':
resource = 'ornl.titan_aprun'
session = rp.Session()
try:
pmgr = rp.PilotManager(session=session)
pd_init = {'resource' : resource,
'runtime' : 15, # pilot runtime (min)
'exit_on_error' : True,
'project' : 'BIP149',
'queue' : 'debug',
'access_schema' : 'local',
'cores' : 192,
'gpus' : 11
}
pdesc = rp.ComputePilotDescription(pd_init)
pilot = pmgr.submit_pilots(pdesc)
umgr = rp.UnitManager(session=session)
umgr.add_pilots(pilot)
cuds = list()
for i in range(256):
cud = rp.ComputeUnitDescription(from_dict={
'cpu_processes' : 1,
'cpu_thread_type' : 'POSIX',
'cpu_threads' : 1,
'environment' : {'OPENMM_CPU_THREADS': '1',
'OPENMM_CUDA_COMPILER': '`which nvcc`'},
'executable' : 'python',
'arguments' : ['-V'],
'gpu_process_type': 'POSIX',
'gpu_processes' : 1,
'gpu_thread_type' : 'CUDA',
'gpu_threads' : 1})
cuds.append(cud)
umgr.submit_units(cuds)
umgr.wait_units()
finally:
session.close(download=True)
which should be fairly close to the pilot and CU description you are using.
Can you please try to reproduce this run? For me the original problem (mapping the CUs to GPUs / nodes) seems resolved.
FWIW, the above is for ornl.titan_aprun
- but the resource label ornl.titan
(which uses the ORTE execution layer) should work as well.
Hi Andre, it took a while to get my setup reconfigured for the new environment, I have a different split now between the application and task environments as the previous ones. Alas it looks like the mapping issue is resolved, so I'll go ahead and test for GPU functionality on my end.
For my setup, I plan to do a module load cudatoolkit
in the pretask to read and store the nvcc path in the OPENMM_CUDA_COMPILER
environment variable. I have this both in the environment and pretask definitions, since it seems nvcc
will not be visible in the environment part as cudatoolkit
isn't loaded yet. I'm assuming that if it isn't, I can overwrite the var in pretask and the new, nonempty value is passed in the orterun
line.
Is there anything in the orterun line I should see to indicate that the gpu will be in the compute environment when the main execution starts?
Here's my current CUD dict:
{'arguments': ['openmmrun.py', ... more args ...],
'cleanup': False,
'cpu_process_type': 'POSIX',
'cpu_processes': 1,
'cpu_thread_type': 'POSIX',
'cpu_threads': 1,
'environment': {'OPENMM_CPU_THREADS': '1',
'OPENMM_CUDA_COMPILER': '`which nvcc`'},
'executable': 'python',
'gpu_process_type': 'POSIX',
'gpu_processes': 1,
'gpu_thread_type': 'CUDA',
'gpu_threads': 1,
'input_staging': [ ... staging actions ... ],
'kernel': None,
'name': '338c21b0-6830-11e8-bdba-0000000001a0',
'output_staging': [ ... staging actions ... ],
'pilot': None,
'post_exec': [ ... ... ],
'pre_exec': ['mkdir -p traj',
'mkdir -p extension',
'source /lustre/atlas/proj-shared/bip149/jrossyra/taskenv/miniconda2/bin/activate admdenv',
'echo "CPU THREADS: ${OPENMM_CPU_THREADS}"',
'module load cudatoolkit',
'export OPENMM_CUDA_COMPILER=`which nvcc`',
'echo $OPENMM_CUDA_COMPILER',
'module unload python',
'module unload python_anaconda',
'echo " >>> TIMER Task start "`date +%s.%3N`'],
'restartable': False,
'stderr': None,
'stdout': None}
I'm going to close the issue since the GPUs are working :) I've run a number of test workflows and everything seems to function as I expect. I suppose most of my questions are a bit moot at this point. Thanks again for all of your help!
When switching to the
project/adaptivemd_gpu_am
branch, the first error I see is this guy:The session didn't close properly (I'll try to catch a Keyboard interrupt where this hung in the future), here are the log files I was able to capture. I pass a total gpu count from AdaptiveMD to RP via a module, which is handed off as a field
gpus
when creating the resource manager.rp.session.tar.gz