radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

Error in Flux LM initialization #2837

Closed mtitov closed 10 months ago

mtitov commented 1 year ago
# cat /root/radical.pilot.sandbox/rp.session.1ed8357e-b63c-11ed-b875-0242ac110003/pilot.0000/agent.0.log
...
1677460516.783 : agent.0              : 4349  : 140182984030016 : DEBUG    : prepare lm FLUX
1677460516.879 : agent.0              : 4370  : 140182984030016 : DEBUG    : starting flux
1677460517.454 : agent.0              : 4349  : 140182984030016 : ERROR    : unusable lm FLUX
Traceback (most recent call last):
  File "/ve_exaworks/lib/python3.7/site-packages/radical/pilot/agent/launch_method/base.py", line 146, in create
    return impl[name](name, lm_cfg, rm_info, log, prof)
  File "/ve_exaworks/lib/python3.7/site-packages/radical/pilot/agent/launch_method/flux.py", line 18, in __init__
    LaunchMethod.__init__(self, name, lm_cfg, rm_info, session, prof)
  File "/ve_exaworks/lib/python3.7/site-packages/radical/pilot/agent/launch_method/base.py", line 82, in __init__
    lm_info = self._envp.get()
  File "/ve_exaworks/lib/python3.7/site-packages/radical/utils/env.py", line 554, in get
    raise exc                         # pylint: disable=raising-bad-type
RuntimeError
1677460517.455 : agent.0              : 4349  : 140182984030016 : ERROR    : skip lm FLUX
Traceback (most recent call last):
  File "/ve_exaworks/lib/python3.7/site-packages/radical/pilot/agent/resource_manager/base.py", line 336, in _prepare_launch_methods
    lm_name, lm_cfg, rm_info, self._log, self._prof)
  File "/ve_exaworks/lib/python3.7/site-packages/radical/pilot/agent/launch_method/base.py", line 146, in create
    return impl[name](name, lm_cfg, rm_info, log, prof)
  File "/ve_exaworks/lib/python3.7/site-packages/radical/pilot/agent/launch_method/flux.py", line 18, in __init__
    LaunchMethod.__init__(self, name, lm_cfg, rm_info, session, prof)
  File "/ve_exaworks/lib/python3.7/site-packages/radical/pilot/agent/launch_method/base.py", line 82, in __init__
    lm_info = self._envp.get()
  File "/ve_exaworks/lib/python3.7/site-packages/radical/utils/env.py", line 554, in get
    raise exc                         # pylint: disable=raising-bad-type
RuntimeError
``` # cat /root/radical.pilot.sandbox/rp.session.1ed8357e-b63c-11ed-b875-0242ac110003/pilot.0000/flux.0003.log 1677460516.902 : flux.0003 : 4370 : 140182984030016 : DEBUG : flux-broker: boot: rank=0 size=1 time 0.006s 1677460516.903 : flux.0003 : 4370 : 140182984030016 : DEBUG : flux-broker: parent: none 1677460516.903 : flux.0003 : 4370 : 140182984030016 : DEBUG : flux-broker: child: none 2023-02-27T01:15:16.914293Z broker.debug[0]: insmod connector-local 2023-02-27T01:15:16.914511Z broker.info[0]: start: none->join 0.0120357s 2023-02-27T01:15:16.914528Z broker.info[0]: parent-none: join->init 5.552e-06s 2023-02-27T01:15:16.916052Z connector-local.debug[0]: allow-guest-user=false 2023-02-27T01:15:16.916089Z connector-local.debug[0]: allow-root-owner=false 2023-02-27T01:15:16.945486Z broker.debug[0]: insmod barrier 2023-02-27T01:15:16.962794Z broker.debug[0]: insmod content-sqlite 2023-02-27T01:15:16.965071Z broker.debug[0]: content backing store: enabled content-sqlite 2023-02-27T01:15:16.979430Z broker.debug[0]: insmod kvs 2023-02-27T01:15:16.994321Z broker.debug[0]: insmod kvs-watch 2023-02-27T01:15:17.009824Z broker.debug[0]: insmod resource 2023-02-27T01:15:17.029726Z resource.debug[0]: xml 1 ranks in 1 objects 2023-02-27T01:15:17.030778Z resource.debug[0]: reslog_cb: resource-init event posted 2023-02-27T01:15:17.031742Z resource.debug[0]: reslog_cb: resource-define event posted 2023-02-27T01:15:17.044788Z broker.debug[0]: insmod cron 2023-02-27T01:15:17.045552Z cron.info[0]: synchronizing cron tasks to event heartbeat.pulse 2023-02-27T01:15:17.062150Z broker.debug[0]: insmod job-manager 2023-02-27T01:15:17.062857Z job-manager.info[0]: restart: 0 jobs 2023-02-27T01:15:17.062950Z job-manager.info[0]: restart: 0 running jobs 2023-02-27T01:15:17.063221Z job-manager.info[0]: restart: no checkpoint object 2023-02-27T01:15:17.063254Z job-manager.debug[0]: restart: max_jobid=0 2023-02-27T01:15:17.078024Z broker.debug[0]: insmod job-info 2023-02-27T01:15:17.103645Z broker.debug[0]: insmod job-list 2023-02-27T01:15:17.104532Z job-list.debug[0]: job_state_init_from_kvs: read 0 jobs 2023-02-27T01:15:17.120594Z broker.debug[0]: insmod job-ingest 2023-02-27T01:15:17.121322Z job-ingest.debug[0]: fluid ts=1ms 2023-02-27T01:15:17.134844Z broker.debug[0]: insmod job-exec 2023-02-27T01:15:17.135356Z job-exec.debug[0]: using default shell path /usr/libexec/flux/flux-shell 2023-02-27T01:15:17.149671Z broker.debug[0]: insmod heartbeat 2023-02-27T01:15:17.152399Z broker.info[0]: rc1.0: running /usr/etc/flux/rc1.d/01-enclosing-instance 2023-02-27T01:15:17.167246Z broker.info[0]: rc1.0: running /usr/etc/flux/rc1.d/01-sched-fluxion-resource-start 2023-02-27T01:15:17.277384Z broker.debug[0]: insmod sched-fluxion-resource 2023-02-27T01:15:17.277878Z sched-fluxion-resource.debug[0]: mod_main: resource module starting 2023-02-27T01:15:17.279521Z sched-fluxion-resource.info[0]: populate_resource_db: loaded resources from core's resource.acquire 2023-02-27T01:15:17.280088Z sched-fluxion-resource.debug[0]: resource status changed (rankset=[all] status=DOWN) 2023-02-27T01:15:17.280133Z sched-fluxion-resource.debug[0]: resource status changed (rankset=[0] status=UP) 2023-02-27T01:15:17.280254Z broker.info[0]: rc1.0: running /usr/etc/flux/rc1.d/02-sched-fluxion-qmanager-start 2023-02-27T01:15:17.280144Z sched-fluxion-resource.debug[0]: mod_main: resource graph database loaded 2023-02-27T01:15:17.319530Z broker.debug[0]: insmod sched-fluxion-qmanager 2023-02-27T01:15:17.320399Z sched-fluxion-qmanager.debug[0]: service_register 2023-02-27T01:15:17.320478Z sched-fluxion-qmanager.debug[0]: enforced policy (queue=default): fcfs 2023-02-27T01:15:17.320495Z sched-fluxion-qmanager.debug[0]: effective queue params (queue=default): default 2023-02-27T01:15:17.320502Z sched-fluxion-qmanager.debug[0]: effective policy params (queue=default): default 2023-02-27T01:15:17.321060Z sched-fluxion-qmanager.debug[0]: handshaking with sched-fluxion-resource completed 2023-02-27T01:15:17.321262Z job-manager.debug[0]: scheduler: hello 2023-02-27T01:15:17.321550Z job-manager.debug[0]: scheduler: ready unlimited 2023-02-27T01:15:17.321711Z sched-fluxion-qmanager.debug[0]: handshaking with job-manager completed 2023-02-27T01:15:17.426546Z broker.info[0]: rc1.0: /bin/bash -c /usr/etc/flux/rc1 Exited (rc=0) 0.5s 2023-02-27T01:15:17.426638Z broker.info[0]: rc1-success: init->quorum 0.512101s 2023-02-27T01:15:17.426668Z broker.info[0]: quorum-full: quorum->run 1.7744e-05s 1677460517.435 : flux.0003 : 4370 : 140182984030016 : DEBUG : export FLUX_PMI_LIBRARY_PATH="/usr/lib/flux/libpmi.so" 1677460517.435 : flux.0003 : 4370 : 140182984030016 : DEBUG : FLUX_PMI_LIBRARY_PATH = /usr/lib/flux/libpmi.so 1677460517.435 : flux.0003 : 4370 : 140182984030016 : DEBUG : export LUA_PATH="/usr/share/lua/5.1/?.lua;;;" 1677460517.435 : flux.0003 : 4370 : 140182984030016 : DEBUG : LUA_PATH = /usr/share/lua/5.1/?.lua;;; 1677460517.435 : flux.0003 : 4370 : 140182984030016 : DEBUG : export FLUX_EXEC_PATH="/usr/libexec/flux/cmd" 1677460517.435 : flux.0003 : 4370 : 140182984030016 : DEBUG : FLUX_EXEC_PATH = /usr/libexec/flux/cmd 1677460517.435 : flux.0003 : 4370 : 140182984030016 : DEBUG : export PYTHONPATH="/usr/lib/flux/python3.7:/ve_exaworks/rp_install/lib/python3.7/site-packages:/ve_exaworks/lib/python3.7/site-packages:/usr/lib64/flux/python3.6:/usr/lib/flux/python3.6" 1677460517.436 : flux.0003 : 4370 : 140182984030016 : DEBUG : PYTHONPATH = /usr/lib/flux/python3.7:/ve_exaworks/rp_install/lib/python3.7/site-packages:/ve_exaworks/lib/python3.7/site-packages:/usr/lib64/flux/python3.6:/usr/lib/flux/python3.6 1677460517.436 : flux.0003 : 4370 : 140182984030016 : DEBUG : export LUA_CPATH="/usr/lib64/lua/5.1/?.so;;;" 1677460517.436 : flux.0003 : 4370 : 140182984030016 : DEBUG : LUA_CPATH = /usr/lib64/lua/5.1/?.so;;; 1677460517.436 : flux.0003 : 4370 : 140182984030016 : DEBUG : export FLUX_CONNECTOR_PATH="/usr/lib/flux/connectors" 1677460517.436 : flux.0003 : 4370 : 140182984030016 : DEBUG : FLUX_CONNECTOR_PATH = /usr/lib/flux/connectors 1677460517.436 : flux.0003 : 4370 : 140182984030016 : DEBUG : export FLUX_URI="local:///tmp/flux-BaZIM2/local-0" 1677460517.436 : flux.0003 : 4370 : 140182984030016 : DEBUG : FLUX_URI = local:///tmp/flux-BaZIM2/local-0 1677460517.436 : flux.0003 : 4370 : 140182984030016 : DEBUG : export MANPATH="/usr/share/man" 1677460517.436 : flux.0003 : 4370 : 140182984030016 : DEBUG : MANPATH = /usr/share/man 1677460517.436 : flux.0003 : 4370 : 140182984030016 : DEBUG : export FLUX_MODULE_PATH="/usr/lib/flux/modules" 1677460517.436 : flux.0003 : 4370 : 140182984030016 : DEBUG : FLUX_MODULE_PATH = /usr/lib/flux/modules 1677460517.436 : flux.0003 : 4370 : 140182984030016 : DEBUG : OK 1677460517.437 : flux.0003 : 4370 : 140182649018112 : INFO : starting flux watcher 1677460517.437 : flux.0003 : 4370 : 140182984030016 : INFO : flux startup successful: [ssh://4502cd6d07c3/tmp/flux-BaZIM2/local-0] 1677460517.437 : flux.0003 : 4370 : 140182657410816 : DEBUG : listen for events 2023-02-27T01:15:18.437598Z broker.err[0]: rc2.0: /bin/bash Broken pipe (rc=141) 1.0s 2023-02-27T01:15:18.437671Z broker.info[0]: rc2-fail: run->cleanup 1.01108s 2023-02-27T01:15:18.449111Z broker.info[0]: cleanup.0: /bin/bash -c flux queue stop --quiet Exited (rc=0) 0.0s 2023-02-27T01:15:18.461058Z broker.info[0]: cleanup.1: /bin/bash -c flux job cancelall --user=all --quiet -f --states RUN Exited (rc=0) 0.0s 2023-02-27T01:15:18.482926Z broker.info[0]: cleanup.2: /bin/bash -c flux queue idle --quiet Exited (rc=0) 0.0s 2023-02-27T01:15:18.483002Z broker.info[0]: cleanup-success: cleanup->shutdown 0.0453192s 2023-02-27T01:15:18.483077Z broker.info[0]: children-none: shutdown->finalize 6.4164e-05s 2023-02-27T01:15:18.493852Z broker.info[0]: rc3.0: running /usr/etc/flux/rc3.d/01-sched-fluxion-qmanager-stop 2023-02-27T01:15:18.508185Z broker.debug[0]: rmmod sched-fluxion-qmanager 2023-02-27T01:15:18.508397Z sched-fluxion-qmanager.debug[0]: service_unregister 2023-02-27T01:15:18.508686Z broker.debug[0]: module sched-fluxion-qmanager exited 2023-02-27T01:15:18.509110Z sched-fluxion-resource.debug[0]: disconnect_request_cb: a notify request aborted 2023-02-27T01:15:18.509164Z job-manager.debug[0]: alloc: stop due to disconnect: Success 2023-02-27T01:15:18.510235Z broker.info[0]: rc3.0: running /usr/etc/flux/rc3.d/02-sched-fluxion-resource-stop 2023-02-27T01:15:18.525590Z broker.debug[0]: rmmod sched-fluxion-resource 2023-02-27T01:15:18.525916Z broker.debug[0]: module sched-fluxion-resource exited 2023-02-27T01:15:18.526073Z resource.debug[0]: aborted 1 resource.acquire(s) 2023-02-27T01:15:18.534952Z broker.debug[0]: rmmod heartbeat 2023-02-27T01:15:18.535127Z broker.debug[0]: module heartbeat exited 2023-02-27T01:15:18.552869Z broker.debug[0]: rmmod resource 2023-02-27T01:15:18.553063Z broker.debug[0]: module resource exited 2023-02-27T01:15:18.561726Z broker.debug[0]: rmmod job-exec 2023-02-27T01:15:18.562333Z broker.debug[0]: module job-exec exited 2023-02-27T01:15:18.572686Z broker.debug[0]: rmmod job-list 2023-02-27T01:15:18.573087Z broker.debug[0]: module job-list exited 2023-02-27T01:15:18.582384Z broker.debug[0]: rmmod job-info 2023-02-27T01:15:18.582575Z broker.debug[0]: module job-info exited 2023-02-27T01:15:18.593011Z broker.debug[0]: rmmod job-manager 2023-02-27T01:15:18.594041Z broker.debug[0]: module job-manager exited 2023-02-27T01:15:18.607542Z broker.debug[0]: rmmod job-ingest 2023-02-27T01:15:18.608019Z broker.debug[0]: module job-ingest exited 2023-02-27T01:15:18.620162Z broker.debug[0]: rmmod cron 2023-02-27T01:15:18.620568Z broker.debug[0]: module cron exited 2023-02-27T01:15:18.631832Z broker.debug[0]: rmmod barrier 2023-02-27T01:15:18.632024Z broker.debug[0]: module barrier exited 2023-02-27T01:15:18.644344Z broker.debug[0]: rmmod kvs-watch 2023-02-27T01:15:18.644600Z broker.debug[0]: module kvs-watch exited 2023-02-27T01:15:18.657270Z broker.debug[0]: rmmod kvs 2023-02-27T01:15:18.658247Z broker.debug[0]: module kvs exited 2023-02-27T01:15:18.688846Z broker.debug[0]: rmmod content-sqlite 2023-02-27T01:15:18.689029Z broker.debug[0]: content backing store: disabled 2023-02-27T01:15:18.689616Z broker.debug[0]: module content-sqlite exited 2023-02-27T01:15:18.692428Z broker.info[0]: rc3.0: /bin/bash -c /usr/etc/flux/rc3 Exited (rc=0) 0.2s 2023-02-27T01:15:18.692505Z broker.info[0]: rc3-success: finalize->exit 0.209416s 2023-02-27T01:15:18.692604Z broker.debug[0]: rmmod connector-local 2023-02-27T01:15:18.693095Z broker.debug[0]: module connector-local exited ```
andre-merzky commented 1 year ago

We didn't look at flux for quite a time, might be worthwhile to add to the tests, even though the deployment will eat up quite some time.

this sticks out:

2023-02-27T01:15:18.437598Z broker.err[0]: rc2.0: /bin/bash Broken pipe (rc=141) 1.0s Hard to tell why the pipe fails - we need to check our plumbing for the flux stdio grab I guess...

What is the priority on this? And what Flux version did you happen to use?

mtitov commented 1 year ago

What is the priority on this?

~I've prepared a PR within ExaWorks to run RP-Flux as part of the testing, thus would be good to keep it open not for a long time - 1-2 sprints (?)~

And what Flux version did you happen to use?

that came out while running SDK docker container which has flux-core=0.28.0, flux-sched=0.17.0. Flux spack package has the latest versions, but our exaworks spack package failing due to swift/t dependency error.

mtitov commented 1 year ago

For test purposes we can use a corresponding testing environment - https://github.com/ExaWorks/containerized-testing-environment/tree/main/psij/flux

mtitov commented 1 year ago

Tried to change the command to start FLUX, to imitate tty with no luck Original command we have in RU

        check = 'flux env; echo "OK"; while true; do echo "ok"; sleep 1; done'
        start = 'flux start -o,-v,-S,log-filename=%s.log' % self._uid
        cmd   = '/bin/bash -c "echo \\\"%s\\\" | %s"' % (check, start)

(*) with the [relatively] new update in Flux, flux start should be launched either within tty or with executable (e.g., flux start <executable>)

Work in progress..

andre-merzky commented 1 year ago

Flux team responded with a recipe to run Flux w/o pty - @andre-merzky to apply patch

andre-merzky commented 1 year ago

https://github.com/flux-framework/flux-core/issues/5427

mturilli commented 1 year ago

@andre-merzky could you open a PR draft while you work on this? The idea is to start to review early to speed up the process as much as possible (i.e., supporting Frontier).