radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

Raptor termination failure #3119

Closed AymenFJA closed 7 months ago

AymenFJA commented 8 months ago

This is related to https://github.com/radical-cybertools/radical.pilot/issues/3116 and https://github.com/Parsl/parsl/issues/3013.

The main issue is if a missing module or an issue happens during the initialization of the master or worker, RAPTOR does not send a termination signal to the pilot, and as a consequence, everything hangs.

andre-merzky commented 8 months ago

The main issue is if a missing module or an issue happens during the initialization of the master or worker, RAPTOR does not send a termination signal to the pilot, and as a consequence, everything hangs.

This is intentional really: from the perspective of RP and the pilot, Raptor masters and workers are just tasks, and the pilot should indeed not terminate of those tasks die. It would be up to the application to watch state of submitted raptor entities and take actions (like pilot termination) if a FAILED state is detected.

andre-merzky commented 8 months ago

@AymenFJA : please have a look at #3121. It changes the raptor example code to demonstrate how worker failures can be caught by the master which then terminated, and how the client side reacts on master termination. Is that approach usable for RPEX?

AymenFJA commented 8 months ago

@andre-merzky, it seems like something is wrong or I am missing something:

this is a small test I did to test the state change of the MPI-Worker, and on purpose, there is no mpi4py in the RAPTOR env:

In [17]: worker = raptor.submit_workers(rp.TaskDescription(
    ...:             {'mode': rp.RAPTOR_WORKER,
    ...:              'raptor_class': 'MPIWorker'}))[0]

  task raptor.0000.0001              : TMGR_SCHEDULING_PENDING
  task raptor.0000.0001              : TMGR_SCHEDULING
  task raptor.0000.0001              : TMGR_STAGING_INPUT_PENDING
  task raptor.0000.0001              : TMGR_STAGING_INPUT
  task raptor.0000.0001              : AGENT_STAGING_INPUT_PENDING
  task raptor.0000.0001              : AGENT_STAGING_INPUT
  task raptor.0000.0001              : AGENT_SCHEDULING_PENDING
  task raptor.0000.0001              : AGENT_SCHEDULING
  task raptor.0000.0001              : AGENT_EXECUTING_PENDING
  task raptor.0000.0001              : AGENT_EXECUTING
  task raptor.0000.0001              : AGENT_STAGING_OUTPUT_PENDING
  task raptor.0000.0001              : AGENT_STAGING_OUTPUT
  task raptor.0000.0001              : TMGR_STAGING_OUTPUT_PENDING
  task raptor.0000.0001              : TMGR_STAGING_OUTPUT
  task raptor.0000.0001              : DONE
In [18]: worker.state
Out[18]: 'DONE'

In [19]: worker.exception

In [20]: worker.exit_code
Out[20]: 0

In [21]: worker.exception_detail

In the case above the worker should and must fail because there is no mpi4py and the worker.err did raise an exception:

cat raptor.0000.0001.err
Traceback (most recent call last):
  File "/home/aymen/ve/test_rpex_final/bin/radical-pilot-raptor-worker", line 52, in <module>
    run(sys.argv[1], sys.argv[2], sys.argv[3])
  File "/home/aymen/ve/test_rpex_final/bin/radical-pilot-raptor-worker", line 30, in run
    worker = cls(raptor_id)
  File "/home/aymen/ve/test_rpex_final/lib/python3.8/site-packages/radical/pilot/raptor/worker_mpi.py", line 592, in __init__
    from mpi4py import MPI                                            # noqa
ModuleNotFoundError: No module named 'mpi4py'

Instead, am I getting a DONE state? Any ideas? I am happy to open a corresponding ticket regarding why RAPTOR worker (which is a task) has:

  1. Wrong state
  2. No Exception or Exception details.
AymenFJA commented 8 months ago

@AymenFJA : please have a look at #3121. It changes the raptor example code to demonstrate how worker failures can be caught by the master which then terminated, and how the client side reacts on master termination. Is that approach usable for RPEX?

@andre-merzky While this is a valid approach to terminating on the worker's failure. Unfortunately, the approach you proposed is not sufficient for RPEX at least from my understanding. I have my main state_cb which checks for failure and so on in the main Parsl-Executor which should trigger the shutdown from Parsl. I think doing that on the master level which is in a separate file and namespace gives me no control to tell Parsl that we failed.

andre-merzky commented 7 months ago

The point is that the master can react on the worker's demise by terminating itself (self.stop()) which then triggers the respective callback on the client side, i.e., in your state_cb in the Parse executor.

andre-merzky commented 7 months ago
  • Wrong state

This is addressed in #3123

  • No Exception or Exception details.

That is only available in Failed state (so should work with the above patch)

andre-merzky commented 7 months ago
  • Wrong state This is addressed in #3123

Hotfix release 1.46.2 was pushed to pypi which resolves the invalid state transition - the worker now ends up in FAILED on missing module dependencies.

AymenFJA commented 7 months ago

The point is that the master can react on the worker's demise by terminating itself (self.stop()) which then triggers the respective callback on the client side, i.e., in your state_cb in the Parse executor.

This makes sense now. Thanks, Andre.

AymenFJA commented 7 months ago

I can confirm this is working now, and the state is reported correctly thanks @andre-merzky , @mtitov :

In [14]: tmgr.submit_raptors(rp.TaskDescription({'mode': rp.RAPTOR_MASTER}))
Out[14]: [<Raptor object, uid raptor.0000>]

  task raptor.0000                   : TMGR_SCHEDULING_PENDING
  task raptor.0000                   : TMGR_SCHEDULING
  task raptor.0000                   : TMGR_STAGING_INPUT_PENDING
  task raptor.0000                   : TMGR_STAGING_INPUT
  task raptor.0000                   : AGENT_STAGING_INPUT_PENDING
  task raptor.0000                   : AGENT_STAGING_INPUT
  task raptor.0000                   : AGENT_SCHEDULING_PENDING
  task raptor.0000                   : AGENT_SCHEDULING
  task raptor.0000                   : AGENT_EXECUTING_PENDING
  task raptor.0000                   : AGENT_EXECUTING
In [15]: tmgr.submit_workers(rp.TaskDescription({'mode': rp.RAPTOR_WORKER, 'raptor_class': 'MPIWorker', 'raptor_id': 'raptor.0000'}))
Out[15]: [<RaptorWorker object, uid raptor.0000.0000>]
  task raptor.0000.0000              : TMGR_SCHEDULING_PENDING

  task raptor.0000.0000              : TMGR_SCHEDULING
  task raptor.0000.0000              : TMGR_STAGING_INPUT_PENDING
  task raptor.0000.0000              : TMGR_STAGING_INPUT
  task raptor.0000.0000              : AGENT_STAGING_INPUT_PENDING
  task raptor.0000.0000              : AGENT_STAGING_INPUT
  task raptor.0000.0000              : AGENT_SCHEDULING_PENDING
  task raptor.0000.0000              : AGENT_SCHEDULING
  task raptor.0000.0000              : AGENT_EXECUTING_PENDING
  task raptor.0000.0000              : AGENT_EXECUTING
  task raptor.0000.0000              : AGENT_STAGING_OUTPUT_PENDING
  task raptor.0000.0000              : AGENT_STAGING_OUTPUT
  task raptor.0000.0000              : TMGR_STAGING_OUTPUT_PENDING
  task raptor.0000.0000              : TMGR_STAGING_OUTPUT
  task raptor.0000.0000              : FAILED