radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

AdaptiveMD Branch Pilot Attribute total_gpu_count error #1610

Closed jro1234 closed 6 years ago

jro1234 commented 6 years ago

When switching to the project/adaptivemd_gpu_am branch, the first error I see is this guy:

Traceback (most recent call last):
  File "/lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv/lib/python2.7/site-packages/radical/pilot/pmgr/launching/default.py", line 489, in work
    self._start_pilot_bulk(resource, schema, pilots)
  File "/lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv/lib/python2.7/site-packages/radical/pilot/pmgr/launching/default.py", line 688, in _start_pilot_bulk
    jc.add(js.create_job(jd))
  File "/lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv/lib/python2.7/site-packages/saga/job/service.py", line 281, in create_job
    default = jd_default.get_attribute (key)
  File "/lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv/lib/python2.7/site-packages/saga/attributes.py", line 2353, in get_attribute
    return   self._attributes_i_get        (us_key, _flow)
  File "/lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv/lib/python2.7/site-packages/saga/attributes.py", line 1144, in _attributes_i_get
    d = self._attributes_t_init (key)
  File "/lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv/lib/python2.7/site-packages/saga/attributes.py", line 406, in _attributes_t_init
    raise se.DoesNotExist ("attribute key is invalid: %s"  %  (key))
DoesNotExist: attribute key is invalid: total_gpu_count (/lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv/lib/python2.7/site-packages/saga/attributes.py +406 (_attributes_t_init)  :  raise se.DoesNotExist ("attribute key is invalid: %s"  %  (key)))

The session didn't close properly (I'll try to catch a Keyboard interrupt where this hung in the future), here are the log files I was able to capture. I pass a total gpu count from AdaptiveMD to RP via a module, which is handed off as a field gpus when creating the resource manager.

rp.session.tar.gz

andre-merzky commented 6 years ago

Can you please send the output of radical-stack? You should be on devel or project/adaptivemd (they are the same right now) on saga-python and radical.utils, too.

jro1234 commented 6 years ago

I am actually on the project/adaptivemd_gpu_am branch, I will try again in a minute here after switching to project/adaptivemd. I'm pretty sure in my slew of tests, I did use the 'plain' adaptivemd branch and encountered the same error, but am not 100% on that.

  python               : 2.7.9
  pythonpath           : /sw/titan/.swci/0-login/opt/spack/20180315/linux-suse_linux11-x86_64/gcc-4.3.4/python-2.7.9-v6ctjewwdx6k2qs7ublexz7gnx457jo5/lib/python2.7/site-packages:/sw/xk6/xalt/0.7.5/site:/sw/xk6/xalt/0.7.5/libexec
  virtualenv           : /lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv

  radical.analytics    : v0.45.2-101-g8358b08@devel
  radical.pilot        : 0.47.10-v0.47.10-178-g769972a3@project-adaptivemd_gpu_am
  radical.utils        : 0.47-v0.47-5-g4573dd7@project-adaptivemd
  saga                 : 0.47-v0.46-38-g8b602364@project-adaptivemd
jro1234 commented 6 years ago

It is the same error with this stack, I made a type when fetching logs last time, so here's updated logs too. Thanks!


  python               : 2.7.9
  pythonpath           : /sw/titan/.swci/0-login/opt/spack/20180315/linux-suse_linux11-x86_64/gcc-4.3.4/python-2.7.9-v6ctjewwdx6k2qs7ublexz7gnx457jo5/lib/python2.7/site-packages:/sw/xk6/xalt/0.7.5/site:/sw/xk6/xalt/0.7.5/libexec
  virtualenv           : /lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv

  radical.analytics    : v0.45.2-101-g8358b08@devel
  radical.pilot        : 0.47.5-v0.47.5-170-g747b12d1@project-adaptivemd
  radical.utils        : 0.47-v0.47-5-g4573dd7@project-adaptivemd
  saga                 : 0.47-v0.46-38-g8b602364@project-adaptivemd

rp.session.tar.gz

jro1234 commented 6 years ago

@andre-merzky Any updates here?

andre-merzky commented 6 years ago

Hey @jrossyra , sorry for the late update, its a slow week over here... On what resource are you using RP with those experiments?

Thanks, Andre.

jro1234 commented 6 years ago

Its on Titan, I understand about the lag! Its the titan_orte config.

jro1234 commented 6 years ago

Andre, I also wanted to mention that when I chose which branch to originally test on, I believe I was looking at the latest commit. I just double checked, and project/adaptivemd_gpu_am is the most recently updated branch. Although as mentioned above, I'm getting the same error in the case of each branch.

andre-merzky commented 6 years ago

I think you are on the right branch in SAGA, but on an outdated commit (8b602364). Can you please try upgrading all three repos and see if that solves the problem?

Your first stack pasted above is correct in terms of branches (RP: project/adaptivemd_gpu_am, RS and RU: project/adaptivemd

jro1234 commented 6 years ago

Thats embarrassing :\

Just to confirm, the install order is 1) utils, 2) saga, 3) pilot right?

It looks like there's a new dependency, the bitarray package. There is no way to install packages with pip on titan right now using their python module since the SSL upgrade to PyPI, so I can't get the packages to import. I've been treading as lightly as possible since then because I'm trying to avoid breaking my setup, if there's a way I can work around this let me know. I'm not fancy enough to have an idea how to work around this one, so I've been using --no-deps and --no-index to get my repos to reinstall at all, otherwise I error out.

vivek-bala commented 6 years ago

Yes, the order of the packages is correct. I ran into the same error on Titan, you need to use the python_anaconda module. I have tested it with RP and seems to be working as intended.

andre-merzky commented 6 years ago

Sorry for that - the bitarray dependency is now fixed, please give it another try... Thanks!

jro1234 commented 6 years ago

Reinstalling now. Had to save some changes to my execution scripts, hopefully reinstall works smoothly under the anaconda module and I'll post back soon.

jro1234 commented 6 years ago

I have a different attribute failure with the gpu attribute deactivated (not passed to computepilotdescription).

rp.session.nogpuattr.tar.gz

I ran a second with the gpus attribute, and had the mpi attribute error come up first.

andre-merzky commented 6 years ago

Hey John - thanks for the logs! I'm afraid I am not really sure what to make of them though: I only see one attribute related message, which is:

Skipping adaptor saga.adaptors.context.myproxy: loading failed: ''Adaptor' object has no attribute 'opts''

which is a warning and indeed should not affect your run. But the pilot seems to be canceled, so I assume there was an error somewhere. Can you please also attach the stdout / stderr of your application, or any other information you have about the failure mode? Please also attach the outputof radical-stack once more.

Thanks!

jro1234 commented 6 years ago

Yes, I should have done these things!

  python               : 2.7.12
  pythonpath           : /sw/xk6/xalt/0.7.5/site:/sw/xk6/xalt/0.7.5/libexec
  virtualenv           : /lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv

  radical.pilot        : 0.47.10-v0.47.10-178-g769972a3@project-adaptivemd_gpu_am
  radical.utils        : 0.47.4-merge-pre_gpu-20-g3e0240f@project-adaptivemd
  saga                 : 0.47.3-merge-pre_gpu-31-g8af8a223@project-adaptivemd

I didn't realize the error didn't propogate, here's the output on the screen. I'll go hunting for more, just wanted to get this over quickly.

new session: [rp.session.titan-ext1.jrossyra.017662.0001]                      \
database   : [mongodb://160.91.205.198:27017/rp]                              ok
create pilot manager                                                          ok
submit 1 pilot(s)
        .                                                                     ok
create unit manager                                                           ok
add 1 pilot(s)                                                                ok
Traceback (most recent call last):
  File "/lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv/lib/python2.7/site-packages/adaptivemd/rp/utils.py", line 246, in create_cud_from_task_def
    cud = generate_trajectorygenerationtask_cud(task_desc, db, shared_path, project)
  File "/lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv/lib/python2.7/site-packages/adaptivemd/rp/utils.py", line 386, in generate_trajectorygenerationtask_cud
    cud.mpi = False
  File "/lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv/lib/python2.7/site-packages/saga/attributes.py", line 2643, in __setattr__
    return self._attributes_i_set      (key, val, flow=self._DOWN)
  File "/lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv/lib/python2.7/site-packages/saga/attributes.py", line 1068, in _attributes_i_set
    raise se.IncorrectState ("attribute set is not extensible/private (key %s)" %  key)
IncorrectState: attribute set is not extensible/private (key mpi) (/lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv/lib/python2.7/site-packages/saga/attributes.py +1068 (_attributes_i_set)  :  raise se.IncorrectState ("attribute set is not extensible/private (key %s)" %  key))

2018-05-11 14:09:11,047: client.rp           : Process-2                       : MainThread     : ERROR   : Client process failed, error: Error: attribute set is not extensible/private (key mpi) (/lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv/lib/python2.7/site-packages/saga/attributes.py +1068 (_attributes_i_set)  :  raise se.IncorrectState ("attribute set is not extensible/private (key %s)" %  key))
Traceback (most recent call last):
  File "/lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv/lib/python2.7/site-packages/adaptivemd/rp/client.py", line 167, in _runme
    cuds = create_cud_from_task_def(task_descs, self._db, resource_desc_for_pilot['shared_path'], self._project)
  File "/lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv/lib/python2.7/site-packages/adaptivemd/rp/utils.py", line 259, in create_cud_from_task_def
    raise Error(msg=ex)
Error: Error: attribute set is not extensible/private (key mpi) (/lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv/lib/python2.7/site-packages/saga/attributes.py +1068 (_attributes_i_set)  :  raise se.IncorrectState ("attribute set is not extensible/private (key %s)" %  key))

wait for 1 pilot(s)
                                                                              ok
closing session rp.session.titan-ext1.jrossyra.017662.0001                     \
close unit manager                                                            ok
close pilot manager                                                            \
wait for 1 pilot(s)
                                                                         timeout
                                                                              ok
+ rp.session.titan-ext1.jrossyra.017662.0001 (json)
- pilot.0000 (profiles)
+ pilot.0000 (logfiles)
session lifetime: 295.6s                                                      ok
andre-merzky commented 6 years ago

Ah, that already helps, and points to a problem: RP indeed should support the mpi attribute for backward compatibility! This is an easy fix fortunately - brb :-) Thanks!

jro1234 commented 6 years ago

Ok very good! I also just noticed it was at the actual line where the attribute is set, which I should have seen before. Thanks!

andre-merzky commented 6 years ago

This is pushed now (to RP).

jro1234 commented 6 years ago

Testing now, thanks! (looks like I should switch to the devel branch, I'll stay there unless you indicate otherwise)

jro1234 commented 6 years ago

It looks like all the objects are happy to exist now, however the CU's don't progress past the AGENT_SCHEDULING_PENDING state. Here are the logs: rp.session.tar.gz

And just a snippet of printout, including from my unit callback function:

new session: [rp.session.titan-ext7.jrossyra.017662.0006]                      \
database   : [mongodb://160.91.205.208:27017/rp]                              ok
create pilot manager                                                          ok
{'cores': 192,
 'gpus': 11,
 'project': 'bip149',
 'queue': 'batch',
 'resource': 'ornl.titan_orte',
 'runtime': 15}
submit 1 pilot(s)
        .                                                                     ok
create unit manager                                                           ok
add 1 pilot(s)                                                                ok
attribute key / property name 'mpi' is deprecated - use 'cpu_process_type'
attribute key / property name 'cores' is deprecated - use 'cpu_processes'
attribute key / property name 'mpi' is deprecated - use 'cpu_process_type'
attribute key / property name 'cores' is deprecated - use 'cpu_processes'
attribute key / property name 'mpi' is deprecated - use 'cpu_process_type'
attribute key / property name 'cores' is deprecated - use 'cpu_processes'
attribute key / property name 'mpi' is deprecated - use 'cpu_process_type'
attribute key / property name 'cores' is deprecated - use 'cpu_processes'
attribute key / property name 'mpi' is deprecated - use 'cpu_process_type'
attribute key / property name 'cores' is deprecated - use 'cpu_processes'
[<radical.pilot.compute_unit_description.ComputeUnitDescription object at 0x2b91d7c58790>,
 <radical.pilot.compute_unit_description.ComputeUnitDescription object at 0x2b91d7c588d0>,
 <radical.pilot.compute_unit_description.ComputeUnitDescription object at 0x2b91d7c58b10>,
 <radical.pilot.compute_unit_description.ComputeUnitDescription object at 0x2b91d7c58d10>,
 <radical.pilot.compute_unit_description.ComputeUnitDescription object at 0x2b91d7c58f10>]
submit 5 unit(s)
        .....                                                                 ok
CALLBACK state:  unit.000000 UMGR_SCHEDULING_PENDING
CALLBACK state:  unit.000001 UMGR_SCHEDULING_PENDING
CALLBACK state:  unit.000002 UMGR_SCHEDULING_PENDING
CALLBACK state:  unit.000003 UMGR_SCHEDULING_PENDING
CALLBACK state:  unit.000004 UMGR_SCHEDULING_PENDING
CALLBACK state:  unit.000000 UMGR_SCHEDULING
CALLBACK state:  unit.000001 UMGR_SCHEDULING
CALLBACK state:  unit.000002 UMGR_SCHEDULING
CALLBACK state:  unit.000003 UMGR_SCHEDULING
CALLBACK state:  unit.000004 UMGR_SCHEDULING
CALLBACK state:  unit.000000 UMGR_STAGING_INPUT_PENDING
CALLBACK state:  unit.000001 UMGR_STAGING_INPUT_PENDING
CALLBACK state:  unit.000002 UMGR_STAGING_INPUT_PENDING
CALLBACK state:  unit.000003 UMGR_STAGING_INPUT_PENDING
CALLBACK state:  unit.000004 UMGR_STAGING_INPUT_PENDING
CALLBACK state:  unit.000000 UMGR_STAGING_INPUT
CALLBACK state:  unit.000001 UMGR_STAGING_INPUT
CALLBACK state:  unit.000002 UMGR_STAGING_INPUT
CALLBACK state:  unit.000003 UMGR_STAGING_INPUT
CALLBACK state:  unit.000004 UMGR_STAGING_INPUT
CALLBACK state:  unit.000000 AGENT_STAGING_INPUT_PENDING
CALLBACK state:  unit.000001 AGENT_STAGING_INPUT_PENDING
CALLBACK state:  unit.000002 AGENT_STAGING_INPUT_PENDING
CALLBACK state:  unit.000003 AGENT_STAGING_INPUT_PENDING
CALLBACK state:  unit.000004 AGENT_STAGING_INPUT_PENDING
CALLBACK state:  unit.000000 AGENT_STAGING_INPUT
CHECKER is sleeping
CALLBACK state:  unit.000003 AGENT_STAGING_INPUT
CALLBACK state:  unit.000003 AGENT_SCHEDULING_PENDING
CALLBACK state:  unit.000002 AGENT_STAGING_INPUT
CALLBACK state:  unit.000002 AGENT_SCHEDULING_PENDING
CALLBACK state:  unit.000001 AGENT_STAGING_INPUT
CALLBACK state:  unit.000001 AGENT_SCHEDULING_PENDING
CALLBACK state:  unit.000000 AGENT_SCHEDULING_PENDING
CALLBACK state:  unit.000004 AGENT_STAGING_INPUT
CALLBACK state:  unit.000004 AGENT_SCHEDULING_PENDING
CHECKER is sleeping
CHECKER is sleeping
CHECKER is sleeping
...
CHECKER is sleeping
2018-05-11 16:35:39,100: resource_manager.rp : Process-2                       : pmgr.0000.subscriber._state_sub_cb: ERROR   : Pilot has failed

That is at the end of PBS job. My workflow persists past this, so I did a keyboard interrupt after than Let me know what other output woudl be useful. stack for the record:

  python               : 2.7.12
  pythonpath           : /sw/xk6/xalt/0.7.5/site:/sw/xk6/xalt/0.7.5/libexec
  virtualenv           : /lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv

  radical.analytics    : 0.47.0
  radical.pilot        : 0.47.10-v0.47.10-189-gf248cfa5@project-adaptivemd_gpu_am
  radical.utils        : 0.47.4-merge-pre_gpu-20-g3e0240f@project-adaptivemd
  saga                 : 0.47.3-merge-pre_gpu-31-g8af8a223@project-adaptivemd
andre-merzky commented 6 years ago

Hmm, this is strange: the pilot agent seems to come up all right. It pulls the unit and seems to go through the input staging part, too - but then nothing. There seems to be some logfiles missing, specifically the agent_0.scheduling.0.child.log, but also others, and there are also no .out and .err files in the pilot sandbox - so its really hard to tell whats happening. DO you have any idea why those logfiles might be missing?

jro1234 commented 6 years ago

I am not sure... I will inspect everything more closely, and I can upload more from the agent directory.

andre-merzky commented 6 years ago

Yes, please do pack the agent sandbox - thanks!

jro1234 commented 6 years ago

I noticed that I had my environment sourceing commented out in the bashrc, so I ran another workflow with this fixed. This has always been a fixture of my platform. However it looks like the same (lack of) errors, here are session and agent folders. The agent_0 scheduling file is still missing. (identical stack) pilot.tar.gz rp.session.tar.gz

andre-merzky commented 6 years ago

Thanks John - we are finally looking at differences between Mongo 2 and 3, what the branch was all about :-) I pushed another commit to RP which should avoid this problem (a failing update call stalled the agent)

jro1234 commented 6 years ago

@andre-merzky I have a different version of the workflow logs in the slack, however I thought I'd run with debug on incase that helps troubleshoot the current issue. The error is still that the scheduler sees each CU as not fitting on a single node. Let me know if there's any further info I can provide to troubleshoot.

Traceback (most recent call last):
  File "/lustre/atlas/scratch/jrossyra/bip149/radical.pilot.sandbox/rp.session.titan-ext6.jrossyra.017665.0000/pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 1264, in work_cb
    self._workers[state](things)
  File "/lustre/atlas/scratch/jrossyra/bip149/radical.pilot.sandbox/rp.session.titan-ext6.jrossyra.017665.0000/pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/agent/scheduler/base.py", line 425, in _schedule_units
    if self._try_allocation(unit):
  File "/lustre/atlas/scratch/jrossyra/bip149/radical.pilot.sandbox/rp.session.titan-ext6.jrossyra.017665.0000/pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/agent/scheduler/base.py", line 449, in _try_allocation
    unit['slots'] = self._allocate_slot(unit['description'])
  File "/lustre/atlas/scratch/jrossyra/bip149/radical.pilot.sandbox/rp.session.titan-ext6.jrossyra.017665.0000/pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/agent/scheduler/continuous.py", line 129, in _allocate_slot
    slots = self._alloc_nompi(cud)
  File "/lustre/atlas/scratch/jrossyra/bip149/radical.pilot.sandbox/rp.session.titan-ext6.jrossyra.017665.0000/pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/agent/scheduler/continuous.py", line 291, in _alloc_nompi
    raise ValueError('Non-mpi unit does not fit onto single node')

I have these attributes for the CUs. I've tried setting the cpu_process_type, but seems it always reverts back to False:

 'cpu_process_type': 'False',
 'cpu_processes': 1,
 'cpu_thread_type': 'POSIX',
 'cpu_threads': 1,
 'environment': {'OPENMM_CPU_THREADS': '1',
                 'OPENMM_CUDA_COMPILER': '`which nvcc`'},
 'executable': 'python',
 'gpu_process_type': 'POSIX',
 'gpu_processes': 1,
 'gpu_thread_type': 'CUDA',
 'gpu_threads': 1,

My stack is:

  python               : 2.7.12
  pythonpath           : /sw/xk6/xalt/0.7.5/site:/sw/xk6/xalt/0.7.5/libexec
  virtualenv           : /lustre/atlas/proj-shared/bip149/jrossyra/admdrp/admdrpenv

  radical.analytics    : 0.47.0
  radical.pilot        : 0.47.10-v0.47.10-190-g87049544@project-adaptivemd_gpu_am
  radical.utils        : 0.47.4-merge-pre_gpu-20-g3e0240f@project-adaptivemd
  saga                 : 0.47.3-merge-pre_gpu-31-g8af8a223@project-adaptivemd

rp.session.tar.gz

jro1234 commented 6 years ago

@andre-merzky @mturilli @vivek-bala Just want to ping on this issue. I've tested a couple different option configurations from the gpu examples and my own, if I add gpu attributes to the cud I seem to always get this error.

andre-merzky commented 6 years ago

Hey John - I had to open a titan ticket on a pip problem which stopped me from reproducing your problem. Alas, that ticket is still open. I pinged them again, and will report back as soon as I hear something.

jro1234 commented 6 years ago

@vivek-bala suggested for me to use the python_anaconda module on Titan to resolve the SSL version issue with pip, if this is the problem, it might work put with a direct swap in the config files or wherever is relevant. When I asked the Titan folks, they didn't make this suggestion which would have been super helpful.

andre-merzky commented 6 years ago

Hi @jrossyra,

the python stack on Titan is now functional again. I can run the code below against this stack:

  python               : 2.7.9
  pythonpath           : /sw/titan/.swci/0-login/opt/spack/20180315/linux-suse_linux11-x86_64/gcc-4.3.4/python-2.7.9-v6ctjewwdx6k2qs7ublexz7gnx457jo5/lib/python2.7/site-packages:/sw/xk6/xalt/0.7.5/site:/sw/xk6/xalt/0.7.5/libexec
  virtualenv           : /autofs/nccs-svm1_home1/merzky1/radical/ve.jo

  radical.pilot        : 0.47.13-v0.47.13-196-gdde57892@project-adaptivemd_gpu_am
  radical.utils        : 0.47.5-v0.47.5-26-gb254b8b@project-adaptivemd
  saga                 : 0.47.6-v0.47.6-37-g7e6a1411@project-adaptivemd

This stack is slightly different from yours for all three layers, as I merged a number of fixes over the last days. The test code is:

#!/usr/bin/env python

import radical.pilot as rp

if __name__ == '__main__':

    resource = 'ornl.titan_aprun'
    session  = rp.Session()

    try:

        pmgr    = rp.PilotManager(session=session)
        pd_init = {'resource'      : resource,
                   'runtime'       : 15,  # pilot runtime (min)
                   'exit_on_error' : True,
                   'project'       : 'BIP149',
                   'queue'         : 'debug',
                   'access_schema' : 'local',
                   'cores'         : 192,
                   'gpus'          : 11
                  }
        pdesc = rp.ComputePilotDescription(pd_init)
        pilot = pmgr.submit_pilots(pdesc)

        umgr = rp.UnitManager(session=session)
        umgr.add_pilots(pilot)

        cuds = list()
        for i in range(256):

            cud = rp.ComputeUnitDescription(from_dict={
                'cpu_processes'   : 1,
                'cpu_thread_type' : 'POSIX',
                'cpu_threads'     : 1,
                'environment'     : {'OPENMM_CPU_THREADS': '1',
                                     'OPENMM_CUDA_COMPILER': '`which nvcc`'},
                'executable'      : 'python',
                'arguments'       : ['-V'],
                'gpu_process_type': 'POSIX',
                'gpu_processes'   : 1,
                'gpu_thread_type' : 'CUDA',
                'gpu_threads'     : 1})
            cuds.append(cud)

        umgr.submit_units(cuds)
        umgr.wait_units()

    finally:
        session.close(download=True)

which should be fairly close to the pilot and CU description you are using.

Can you please try to reproduce this run? For me the original problem (mapping the CUs to GPUs / nodes) seems resolved.

andre-merzky commented 6 years ago

FWIW, the above is for ornl.titan_aprun - but the resource label ornl.titan (which uses the ORTE execution layer) should work as well.

jro1234 commented 6 years ago

Hi Andre, it took a while to get my setup reconfigured for the new environment, I have a different split now between the application and task environments as the previous ones. Alas it looks like the mapping issue is resolved, so I'll go ahead and test for GPU functionality on my end.

For my setup, I plan to do a module load cudatoolkit in the pretask to read and store the nvcc path in the OPENMM_CUDA_COMPILER environment variable. I have this both in the environment and pretask definitions, since it seems nvcc will not be visible in the environment part as cudatoolkit isn't loaded yet. I'm assuming that if it isn't, I can overwrite the var in pretask and the new, nonempty value is passed in the orterun line.

Is there anything in the orterun line I should see to indicate that the gpu will be in the compute environment when the main execution starts?

Here's my current CUD dict:

{'arguments': ['openmmrun.py', ... more args ...],
 'cleanup': False,
 'cpu_process_type': 'POSIX',
 'cpu_processes': 1,
 'cpu_thread_type': 'POSIX',
 'cpu_threads': 1,
 'environment': {'OPENMM_CPU_THREADS': '1',
                           'OPENMM_CUDA_COMPILER': '`which nvcc`'},
 'executable': 'python',
 'gpu_process_type': 'POSIX',
 'gpu_processes': 1,
 'gpu_thread_type': 'CUDA',
 'gpu_threads': 1,
 'input_staging': [ ... staging actions ... ],
 'kernel': None,
 'name': '338c21b0-6830-11e8-bdba-0000000001a0',
 'output_staging': [ ... staging actions ... ],
 'pilot': None,
 'post_exec': [ ... ... ],
 'pre_exec': ['mkdir -p traj',
              'mkdir -p extension',
              'source /lustre/atlas/proj-shared/bip149/jrossyra/taskenv/miniconda2/bin/activate admdenv',
              'echo "CPU THREADS: ${OPENMM_CPU_THREADS}"',
              'module load cudatoolkit',
              'export OPENMM_CUDA_COMPILER=`which nvcc`',
              'echo $OPENMM_CUDA_COMPILER',
              'module unload python',
              'module unload python_anaconda',
              'echo "   >>>  TIMER Task start "`date +%s.%3N`'],
 'restartable': False,
 'stderr': None,
 'stdout': None}
jro1234 commented 6 years ago

I'm going to close the issue since the GPUs are working :) I've run a number of test workflows and everything seems to function as I expect. I suppose most of my questions are a bit moot at this point. Thanks again for all of your help!