radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

Support for flux exec scripts #3146

Closed andre-merzky closed 6 months ago

andre-merzky commented 6 months ago

This fixes #3114

mtitov commented 6 months ago

@andre-merzky I've tested it on Frontier and got the following errors (probably flux.dump is not crucial, but the one in agent_execution is):

$ cat flux.0039.log 
1710449568.901 : flux.0039            : 19460 : 140737179921280 : DEBUG    : flux output: URI:local:///tmp/flux-I1uzqO/local-0
1710449568.903 : flux.0039            : 19460 : 140731208460032 : INFO     : starting flux watcher
1710449568.903 : flux.0039            : 19460 : 140737179921280 : INFO     : flux startup successful: [ssh://frontier07386.frontier.olcf.ornl.gov/tmp/flux-I1uzqO/local-0]
$ cat flux.0040.dump 
connect flux 19424: ssh://frontier07386.frontier.olcf.ornl.gov/tmp/flux-I1uzqO/local-0
  File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/bin/radical-pilot-agent_0", line 17, in <module>
    agent = rp.Agent_0()
  File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/pilot/agent/agent_0.py", line 47, in __init__
    self._session = Session(uid=cfg.sid, cfg=cfg, _role=Session._AGENT_0)
  File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/pilot/session.py", line 192, in __init__
    elif self._role == self._AGENT_0: self._init_agent_0()
  File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/pilot/session.py", line 288, in _init_agent_0
    self._init_rm()
  File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/pilot/session.py", line 812, in _init_rm
    self._rm = ResourceManager.create(name=rname,
  File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/pilot/agent/resource_manager/base.py", line 382, in create
    return rm(cfg, rcfg, log, prof)
  File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/pilot/agent/resource_manager/base.py", line 161, in __init__
    self._prepare_launch_methods()
  File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/pilot/agent/resource_manager/base.py", line 339, in _prepare_launch_methods
    self._launchers[lm_name] = rpa.LaunchMethod.create(
  File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/pilot/agent/launch_method/base.py", line 148, in create
    return impl[name](name, lm_cfg, rm_info, log, prof)
  File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/pilot/agent/launch_method/flux.py", line 18, in __init__
    LaunchMethod.__init__(self, name, lm_cfg, rm_info, session, prof)
  File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/pilot/agent/launch_method/base.py", line 91, in __init__
    self._init_from_info(lm_info)
  File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/pilot/agent/launch_method/flux.py", line 62, in _init_from_info
    self._fh.connect_flux(uri=self._details['flux_uri'])
  File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/utils/flux.py", line 419, in connect_flux
    for l in get_stacktrace():
$ cat agent_executing.0000.log 
...
1710449602.816 : agent_executing.0000 : 20215 : 140735505663744 : DEBUG    : submit tasks: [{'tasks': [{'slot': 'task', 'count': {'per_slot': 1}, 'command': ['/bin/sh', '-c', '/lustre/orion/scratch/matitov/chm155/radical.pilot.sandbox/rp.session.frontier07386.matitov.019796.0000/pilot.0000/task.000000//task.000000.exec.sh 1>/lustre/orion/scratch/matitov/chm155/radical.pilot.sandbox/rp.session.frontier07386.matitov.019796.0000/pilot.0000/task.000000//task.000000.out 2>/lustre/orion/scratch/matitov/chm155/radical.pilot.sandbox/rp.session.frontier07386.matitov.019796.0000/pilot.0000/task.000000//task.000000.err']}], 'attributes': {'system': {'cwd': '/lustre/orion/scratch/matitov/chm155/radical.pilot.sandbox/rp.session.frontier07386.matitov.019796.0000/pilot.0000/task.000000/', 'duration': 0}}, 'version': 1, 'resources': [{'count': 8, 'type': 'slot', 'label': 'task', 'with': [{'count': 1, 'type': 'core'}, {'count': 1.0, 'type': 'gpu'}]}]}]
1710449603.099 : agent_executing.0000 : 20215 : 140735505663744 : ERROR    : work <bound method Flux.work of <radical.pilot.agent.executing.flux.Flux object at 0x7fffe4cd9d00>> failed
Traceback (most recent call last):
  File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/pilot/utils/component.py", line 1032, in work_cb
    self._workers[state](things)
  File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/pilot/agent/executing/flux.py", line 129, in work
    jids = self._lm.fh.submit_jobs([jd for jd in jds])
  File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/utils/flux.py", line 477, in submit_jobs
    flux_id = fut.get_id()
  File "/usr/lib64/python3.6/site-packages/flux/job/submit.py", line 26, in get_id
    return submit_get_id(self)
  File "/usr/lib64/python3.6/site-packages/flux/util.py", line 70, in func_wrapper
    raise EnvironmentError(error.errno, errmsg.decode("utf-8")) from None
PermissionError: [Errno 1] count must be an int or mapping
1710449603.100 : agent_executing.0000 : 20215 : 140735505663744 : DEBUG    : advance bulk: 1 [False, True, FAILED]

(*) single task with 8 ranks, 1 core per rank and 1 gpu per rank, and it has its own conda env (which is activated in its pre-exec section)

andre-merzky commented 6 months ago

(*) single task with 8 ranks, 1 core per rank and 1 gpu per rank, and it has its own conda env (which is activated in its pre-exec section)

Thanks @mtitov - this was caused by our float type for gpus_per_rank and should be fixed now - please do give it another try.

andre-merzky commented 6 months ago

Is this related to flux-framework/flux-core#3372?

I don't think so - we don't use sched-simple. Maybe the remaining float type was causing this - how does it look now? Anything in the flux*.log files?

codecov[bot] commented 6 months ago

Codecov Report

Attention: Patch coverage is 83.39350% with 46 lines in your changes are missing coverage. Please review.

Project coverage is 44.25%. Comparing base (c2489e0) to head (5a664f5).

Files Patch % Lines
src/radical/pilot/agent/executing/flux.py 5.88% 32 Missing :warning:
src/radical/pilot/agent/executing/base.py 96.47% 8 Missing :warning:
src/radical/pilot/agent/scheduler/flux.py 0.00% 2 Missing :warning:
src/radical/pilot/agent/scheduler/noop.py 0.00% 2 Missing :warning:
src/radical/pilot/agent/executing/popen.py 80.00% 1 Missing :warning:
src/radical/pilot/agent/launch_method/flux.py 0.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## devel #3146 +/- ## ========================================== - Coverage 44.54% 44.25% -0.29% ========================================== Files 97 97 Lines 10634 10520 -114 ========================================== - Hits 4737 4656 -81 + Misses 5897 5864 -33 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

mtitov commented 6 months ago

seems that Flux doesn't see GPUs

flux resources [ 0 ]:
     STATE NNODES   NCORES    NGPUS NODELIST
      free      1       56        0 frontier00178
 allocated      0        0        0 
      down      0        0        0 

(*) attached flux-logs flux.logs.tar.gz

p.s. I've made this run also using flux update in RU - it works!

andre-merzky commented 6 months ago

I guess we first need to fix the flux startup then - the missing GPUs might be caused by not using srun to start flux? dunno...