radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

move timeout into `kill_pilots` method to delay forced termination #2962

Closed mtitov closed 1 year ago

mtitov commented 1 year ago

@andre-merzky would it be ok to move that timeout into kill_pilots method? thus we wouldn't wait for [fixed] _timeout seconds, if pilot will reach the final state before that

andre-merzky commented 1 year ago

@andre-merzky would it be ok to move that timeout into kill_pilots method? thus we wouldn't wait for [fixed] _timeout seconds, if pilot will reach the final state before that

It's just moving the problem around: with that approach any attempt to directly call kill_pilots would wait for n seconds before doing the kill which is not really intuitive either...

mtitov commented 1 year ago

@andre-merzky would it be ok to move that timeout into kill_pilots method? thus we wouldn't wait for [fixed] _timeout seconds, if pilot will reach the final state before that

It's just moving the problem around: with that approach any attempt to directly call kill_pilots would wait for n seconds before doing the kill which is not really intuitive either...

with delay inside cancel_pilots it will always take at least that _timeout seconds (e.g., >=30s, with _timeout=30), but moving that delay into kill_pilots lets cancel_pilots to finish as soon as pilot reaches the FINAL state and if it will not reach it during _timeout then it will be "killed". Thus, you will let pilot to terminate on its own without enforcing minimal exec time for cancel_pilot method

mtitov commented 1 year ago

with that approach any attempt to directly call kill_pilots would wait for n seconds before doing the kill which is not really intuitive either

no, without providing delay it will not wait.

BTW maybe then we can add an additional input parameter?

def cancel_pilots(self, uids=None, _timeout=None, _kill_delay=None):
    ...
    self.publish(rpc.CONTROL_PUBSUB, {'cmd' : 'kill_pilots',
                                      'arg' : {'pmgr' : self.uid,
                                               'uids' : uids,
                                               'delay': _kill_delay}})
    ...
andre-merzky commented 1 year ago

I think we are looking sideways at this. The main problem seems to be that we fall back to kill on cancel. Maybe we should leave that to the application? So, if RE wants to cancel it should call cancel and then wait for final state with timeout. If that fails RE can call kill. RP's session close really wants kill semantics, so it should call kill.

mtitov commented 1 year ago

Maybe we should leave that to the application?

but we don't have "normal" termination then: Agent terminates only after receiving command cancel_pilot (that is what triggers Agent to terminate gracefully and pilot finishes with the state DONE), but this command cancel_pilot is pushed from PilotManager.cancel_pilots method, which also pushes command kill_pilots to kill the pilot job. Thus it is always a race condition which operation will finish first.

andre-merzky commented 1 year ago

I think we are looking sideways at this. The main problem seems to be that we fall back to kill on cancel. Maybe we should leave that to the application? So, if RE wants to cancel it should call cancel and then wait for final state with timeout. If that fails RE can call kill. RP's session close really wants kill semantics, so it should call kill.

codecov[bot] commented 1 year ago

Codecov Report

Merging #2962 (fac6032) into devel (6e25b61) will increase coverage by 0.19%. The diff coverage is 85.71%.

@@            Coverage Diff             @@
##            devel    #2962      +/-   ##
==========================================
+ Coverage   41.45%   41.64%   +0.19%     
==========================================
  Files          95       95              
  Lines       10482    10494      +12     
==========================================
+ Hits         4345     4370      +25     
+ Misses       6137     6124      -13     
Impacted Files Coverage Δ
src/radical/pilot/pilot_manager.py 18.55% <85.71%> (+6.05%) :arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

andre-merzky commented 1 year ago

@mtitov : can you have a look at the last commit, please - something like that ok?

andre-merzky commented 1 year ago

Sorry, I mixed up fix/pilot_termination and this branch :-/

mtitov commented 1 year ago

Summary from the discussion during devel call: this PR is a workaround of the current implementation, but to have a better control over termination mechanism it was agreed to split cancel_pilots and kill_pilots commands into separate methods.

mtitov commented 1 year ago

@mtitov : can you have a look at the last commit, please - something like that ok?

@andre-merzky maybe to have both in PilotManager.close under if terminate:? (whichever will start first) and to have a separate cancel_pilots?

mtitov commented 1 year ago

@andre-merzky can you please approve it and merge?