radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

`session.close()` generates an error in a notebook #2901

Closed mturilli closed 1 year ago

mturilli commented 1 year ago

This has been replicated on two linux hosts, including three.

Notebook error:
CellExecutionError in tutorials/configuration.ipynb:
------------------
session.close()
------------------

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[12], line 1
----> 1 session.close()

File ~/ve/docs/lib/python3.10/site-packages/radical/pilot/session.py:284, in Session.close(self, **kwargs)
    282 for pmgr_uid, pmgr in self._pmgrs.items():
    283     self._log.debug("session %s closes pmgr   %s", self._uid, pmgr_uid)
--> 284     pmgr.close(terminate=options.terminate)
    285     self._log.debug("session %s closed pmgr   %s", self._uid, pmgr_uid)
    287 if self._cmgr:

File ~/ve/docs/lib/python3.10/site-packages/radical/pilot/pilot_manager.py:230, in PilotManager.close(self, terminate)
    228 # If terminate is set, we cancel all pilots.
    229 if terminate:
--> 230     self.cancel_pilots(_timeout=20)
    231     # if this cancel op fails and the pilots are s till alive after
    232     # timeout, the pmgr.launcher termination will kill them
    234 self._terminate.set()

File ~/ve/docs/lib/python3.10/site-packages/radical/pilot/pilot_manager.py:869, in PilotManager.cancel_pilots(self, uids, _timeout)
    864 # inform pmgr.launcher - it will force-kill the pilot after some delay
    865 self.publish(rpc.CONTROL_PUBSUB, {'cmd' : 'kill_pilots',
    866                                   'arg' : {'pmgr' : self.uid,
    867                                            'uids' : uids}})
--> 869 self.wait_pilots(uids=uids, timeout=_timeout)

File ~/ve/docs/lib/python3.10/site-packages/radical/pilot/pilot_manager.py:800, in PilotManager.wait_pilots(self, uids, state, timeout)
    797             self._log.debug ("wait timed out")
    798             break
--> 800     time.sleep (0.1)
    802 self._rep.idle(mode='stop')
    804 if to_check: self._rep.warn('>>timeout\n')

KeyboardInterrupt:
KeyboardInterrupt:
eirrgang commented 1 year ago

If this behavior changes, please let me know. Ref: https://github.com/SCALE-MS/scale-ms/blob/d745bf6acb593f9a0f7bec6227a2d044ed829cb2/src/scalems/radical/runtime.py#L992

mturilli commented 1 year ago

This might explain why the compilation of the notebooks stalls when compiling the documentation. We end up with multiple concurrent instances of RP components. I see the list of processes from at least two notebooks alive while running sphinx.

andre-merzky commented 1 year ago

Can you please provide the radical-stack for this? Thanks!

mturilli commented 1 year ago
$ radical-stack

  python               : /home/mturilli/ve/docs/bin/python3
  pythonpath           :
  version              : 3.10.6
  virtualenv           : /home/mturilli/ve/docs

  radical.analytics    : 1.20.1
  radical.entk         : 1.30.0
  radical.gtod         : 1.20.1
  radical.pilot        : 1.21.0
  radical.saga         : 1.21.0
  radical.utils        : 1.21.0

and on three

$ radical-stack

  python               : /home/mturilli/ve-notebooks/bin/python3
  pythonpath           :
  version              : 3.10.6
  virtualenv           : /home/mturilli/ve-notebooks

  radical.analytics    : 1.20.1
  radical.entk         : 1.30.0
  radical.gtod         : 1.20.1
  radical.pilot        : 1.21.0
  radical.saga         : 1.21.0
  radical.utils        : 1.21.0
andre-merzky commented 1 year ago

This is the same thing we always have on KeyboardInterrupts: the pilot description has exit_on_error set (which is the default) and the pilot fails. Setting that flag to False removes the exception.

The pilot fails because it is submitted but a cancellation request is issued right after during session termination, before the bootstrapper has a chance to complete. So the pilot does not react in time on the termination request and the session falls back to a hard kill which fails the pilot - that results in a respective state update and the exception is raised because the pilot description asks for it. So, this is all as intended. If it should be like this is a different questions, but right now it is not an implementation question but a policy question.

andre-merzky commented 1 year ago

In pushed an exemplary fix for the configuration notebook in the nb3 branch.