Closed AymenFJA closed 7 months ago
The main issue is if a missing module or an issue happens during the initialization of the master or worker, RAPTOR does not send a termination signal to the pilot, and as a consequence, everything hangs.
This is intentional really: from the perspective of RP and the pilot, Raptor masters and workers are just tasks, and the pilot should indeed not terminate of those tasks die. It would be up to the application to watch state of submitted raptor entities and take actions (like pilot termination) if a FAILED
state is detected.
@AymenFJA : please have a look at #3121. It changes the raptor example code to demonstrate how worker failures can be caught by the master which then terminated, and how the client side reacts on master termination. Is that approach usable for RPEX?
@andre-merzky, it seems like something is wrong or I am missing something:
this is a small test I did to test the state change of the MPI-Worker, and on purpose, there is no mpi4py
in the RAPTOR env:
In [17]: worker = raptor.submit_workers(rp.TaskDescription(
...: {'mode': rp.RAPTOR_WORKER,
...: 'raptor_class': 'MPIWorker'}))[0]
task raptor.0000.0001 : TMGR_SCHEDULING_PENDING
task raptor.0000.0001 : TMGR_SCHEDULING
task raptor.0000.0001 : TMGR_STAGING_INPUT_PENDING
task raptor.0000.0001 : TMGR_STAGING_INPUT
task raptor.0000.0001 : AGENT_STAGING_INPUT_PENDING
task raptor.0000.0001 : AGENT_STAGING_INPUT
task raptor.0000.0001 : AGENT_SCHEDULING_PENDING
task raptor.0000.0001 : AGENT_SCHEDULING
task raptor.0000.0001 : AGENT_EXECUTING_PENDING
task raptor.0000.0001 : AGENT_EXECUTING
task raptor.0000.0001 : AGENT_STAGING_OUTPUT_PENDING
task raptor.0000.0001 : AGENT_STAGING_OUTPUT
task raptor.0000.0001 : TMGR_STAGING_OUTPUT_PENDING
task raptor.0000.0001 : TMGR_STAGING_OUTPUT
task raptor.0000.0001 : DONE
In [18]: worker.state
Out[18]: 'DONE'
In [19]: worker.exception
In [20]: worker.exit_code
Out[20]: 0
In [21]: worker.exception_detail
In the case above the worker
should and must fail because there is no mpi4py
and the worker.err
did raise an exception:
cat raptor.0000.0001.err
Traceback (most recent call last):
File "/home/aymen/ve/test_rpex_final/bin/radical-pilot-raptor-worker", line 52, in <module>
run(sys.argv[1], sys.argv[2], sys.argv[3])
File "/home/aymen/ve/test_rpex_final/bin/radical-pilot-raptor-worker", line 30, in run
worker = cls(raptor_id)
File "/home/aymen/ve/test_rpex_final/lib/python3.8/site-packages/radical/pilot/raptor/worker_mpi.py", line 592, in __init__
from mpi4py import MPI # noqa
ModuleNotFoundError: No module named 'mpi4py'
Instead, am I getting a DONE
state? Any ideas? I am happy to open a corresponding ticket regarding why RAPTOR worker (which is a task) has:
@AymenFJA : please have a look at #3121. It changes the raptor example code to demonstrate how worker failures can be caught by the master which then terminated, and how the client side reacts on master termination. Is that approach usable for RPEX?
@andre-merzky While this is a valid approach to terminating on the worker's failure. Unfortunately, the approach you proposed is not sufficient for RPEX at least from my understanding. I have my main state_cb
which checks for failure and so on in the main Parsl-Executor which should trigger the shutdown from Parsl. I think doing that on the master level which is in a separate file and namespace gives me no control to tell Parsl that we failed.
The point is that the master can react on the worker's demise by terminating itself (self.stop()
) which then triggers the respective callback on the client side, i.e., in your state_cb
in the Parse executor.
- Wrong state
This is addressed in #3123
- No Exception or Exception details.
That is only available in Failed state (so should work with the above patch)
- Wrong state This is addressed in #3123
Hotfix release 1.46.2 was pushed to pypi which resolves the invalid state transition - the worker now ends up in FAILED
on missing module dependencies.
The point is that the master can react on the worker's demise by terminating itself (
self.stop()
) which then triggers the respective callback on the client side, i.e., in yourstate_cb
in the Parse executor.
This makes sense now. Thanks, Andre.
I can confirm this is working now, and the state is reported correctly thanks @andre-merzky , @mtitov :
In [14]: tmgr.submit_raptors(rp.TaskDescription({'mode': rp.RAPTOR_MASTER}))
Out[14]: [<Raptor object, uid raptor.0000>]
task raptor.0000 : TMGR_SCHEDULING_PENDING
task raptor.0000 : TMGR_SCHEDULING
task raptor.0000 : TMGR_STAGING_INPUT_PENDING
task raptor.0000 : TMGR_STAGING_INPUT
task raptor.0000 : AGENT_STAGING_INPUT_PENDING
task raptor.0000 : AGENT_STAGING_INPUT
task raptor.0000 : AGENT_SCHEDULING_PENDING
task raptor.0000 : AGENT_SCHEDULING
task raptor.0000 : AGENT_EXECUTING_PENDING
task raptor.0000 : AGENT_EXECUTING
In [15]: tmgr.submit_workers(rp.TaskDescription({'mode': rp.RAPTOR_WORKER, 'raptor_class': 'MPIWorker', 'raptor_id': 'raptor.0000'}))
Out[15]: [<RaptorWorker object, uid raptor.0000.0000>]
task raptor.0000.0000 : TMGR_SCHEDULING_PENDING
task raptor.0000.0000 : TMGR_SCHEDULING
task raptor.0000.0000 : TMGR_STAGING_INPUT_PENDING
task raptor.0000.0000 : TMGR_STAGING_INPUT
task raptor.0000.0000 : AGENT_STAGING_INPUT_PENDING
task raptor.0000.0000 : AGENT_STAGING_INPUT
task raptor.0000.0000 : AGENT_SCHEDULING_PENDING
task raptor.0000.0000 : AGENT_SCHEDULING
task raptor.0000.0000 : AGENT_EXECUTING_PENDING
task raptor.0000.0000 : AGENT_EXECUTING
task raptor.0000.0000 : AGENT_STAGING_OUTPUT_PENDING
task raptor.0000.0000 : AGENT_STAGING_OUTPUT
task raptor.0000.0000 : TMGR_STAGING_OUTPUT_PENDING
task raptor.0000.0000 : TMGR_STAGING_OUTPUT
task raptor.0000.0000 : FAILED
This is related to https://github.com/radical-cybertools/radical.pilot/issues/3116 and https://github.com/Parsl/parsl/issues/3013.
The main issue is if a missing module or an issue happens during the initialization of the master or worker, RAPTOR does not send a termination signal to the pilot, and as a consequence, everything hangs.