Closed andre-merzky closed 6 months ago
@andre-merzky I've tested it on Frontier and got the following errors (probably flux.dump
is not crucial, but the one in agent_execution
is):
$ cat flux.0039.log
1710449568.901 : flux.0039 : 19460 : 140737179921280 : DEBUG : flux output: URI:local:///tmp/flux-I1uzqO/local-0
1710449568.903 : flux.0039 : 19460 : 140731208460032 : INFO : starting flux watcher
1710449568.903 : flux.0039 : 19460 : 140737179921280 : INFO : flux startup successful: [ssh://frontier07386.frontier.olcf.ornl.gov/tmp/flux-I1uzqO/local-0]
$ cat flux.0040.dump
connect flux 19424: ssh://frontier07386.frontier.olcf.ornl.gov/tmp/flux-I1uzqO/local-0
File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/bin/radical-pilot-agent_0", line 17, in <module>
agent = rp.Agent_0()
File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/pilot/agent/agent_0.py", line 47, in __init__
self._session = Session(uid=cfg.sid, cfg=cfg, _role=Session._AGENT_0)
File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/pilot/session.py", line 192, in __init__
elif self._role == self._AGENT_0: self._init_agent_0()
File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/pilot/session.py", line 288, in _init_agent_0
self._init_rm()
File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/pilot/session.py", line 812, in _init_rm
self._rm = ResourceManager.create(name=rname,
File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/pilot/agent/resource_manager/base.py", line 382, in create
return rm(cfg, rcfg, log, prof)
File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/pilot/agent/resource_manager/base.py", line 161, in __init__
self._prepare_launch_methods()
File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/pilot/agent/resource_manager/base.py", line 339, in _prepare_launch_methods
self._launchers[lm_name] = rpa.LaunchMethod.create(
File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/pilot/agent/launch_method/base.py", line 148, in create
return impl[name](name, lm_cfg, rm_info, log, prof)
File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/pilot/agent/launch_method/flux.py", line 18, in __init__
LaunchMethod.__init__(self, name, lm_cfg, rm_info, session, prof)
File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/pilot/agent/launch_method/base.py", line 91, in __init__
self._init_from_info(lm_info)
File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/pilot/agent/launch_method/flux.py", line 62, in _init_from_info
self._fh.connect_flux(uri=self._details['flux_uri'])
File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/utils/flux.py", line 419, in connect_flux
for l in get_stacktrace():
$ cat agent_executing.0000.log
...
1710449602.816 : agent_executing.0000 : 20215 : 140735505663744 : DEBUG : submit tasks: [{'tasks': [{'slot': 'task', 'count': {'per_slot': 1}, 'command': ['/bin/sh', '-c', '/lustre/orion/scratch/matitov/chm155/radical.pilot.sandbox/rp.session.frontier07386.matitov.019796.0000/pilot.0000/task.000000//task.000000.exec.sh 1>/lustre/orion/scratch/matitov/chm155/radical.pilot.sandbox/rp.session.frontier07386.matitov.019796.0000/pilot.0000/task.000000//task.000000.out 2>/lustre/orion/scratch/matitov/chm155/radical.pilot.sandbox/rp.session.frontier07386.matitov.019796.0000/pilot.0000/task.000000//task.000000.err']}], 'attributes': {'system': {'cwd': '/lustre/orion/scratch/matitov/chm155/radical.pilot.sandbox/rp.session.frontier07386.matitov.019796.0000/pilot.0000/task.000000/', 'duration': 0}}, 'version': 1, 'resources': [{'count': 8, 'type': 'slot', 'label': 'task', 'with': [{'count': 1, 'type': 'core'}, {'count': 1.0, 'type': 'gpu'}]}]}]
1710449603.099 : agent_executing.0000 : 20215 : 140735505663744 : ERROR : work <bound method Flux.work of <radical.pilot.agent.executing.flux.Flux object at 0x7fffe4cd9d00>> failed
Traceback (most recent call last):
File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/pilot/utils/component.py", line 1032, in work_cb
self._workers[state](things)
File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/pilot/agent/executing/flux.py", line 129, in work
jids = self._lm.fh.submit_jobs([jd for jd in jds])
File "/lustre/orion/chm155/scratch/matitov/flux/ve.rp/lib/python3.9/site-packages/radical/utils/flux.py", line 477, in submit_jobs
flux_id = fut.get_id()
File "/usr/lib64/python3.6/site-packages/flux/job/submit.py", line 26, in get_id
return submit_get_id(self)
File "/usr/lib64/python3.6/site-packages/flux/util.py", line 70, in func_wrapper
raise EnvironmentError(error.errno, errmsg.decode("utf-8")) from None
PermissionError: [Errno 1] count must be an int or mapping
1710449603.100 : agent_executing.0000 : 20215 : 140735505663744 : DEBUG : advance bulk: 1 [False, True, FAILED]
(*) single task with 8 ranks, 1 core per rank and 1 gpu per rank, and it has its own conda env (which is activated in its pre-exec section)
(*) single task with 8 ranks, 1 core per rank and 1 gpu per rank, and it has its own conda env (which is activated in its pre-exec section)
Thanks @mtitov - this was caused by our float type for gpus_per_rank
and should be fixed now - please do give it another try.
Is this related to flux-framework/flux-core#3372?
I don't think so - we don't use sched-simple
. Maybe the remaining float
type was causing this - how does it look now? Anything in the flux*.log
files?
Attention: Patch coverage is 83.39350%
with 46 lines
in your changes are missing coverage. Please review.
Project coverage is 44.25%. Comparing base (
c2489e0
) to head (5a664f5
).
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
seems that Flux doesn't see GPUs
flux resources [ 0 ]:
STATE NNODES NCORES NGPUS NODELIST
free 1 56 0 frontier00178
allocated 0 0 0
down 0 0 0
(*) attached flux-logs flux.logs.tar.gz
p.s. I've made this run also using flux update in RU - it works!
I guess we first need to fix the flux startup then - the missing GPUs might be caused by not using srun
to start flux? dunno...
This fixes #3114