Closed mtitov closed 1 year ago
@andre-merzky would it be ok to move that
timeout
intokill_pilots
method? thus we wouldn't wait for [fixed]_timeout
seconds, if pilot will reach the final state before that
It's just moving the problem around: with that approach any attempt to directly call kill_pilots
would wait for n
seconds before doing the kill which is not really intuitive either...
@andre-merzky would it be ok to move that
timeout
intokill_pilots
method? thus we wouldn't wait for [fixed]_timeout
seconds, if pilot will reach the final state before thatIt's just moving the problem around: with that approach any attempt to directly call
kill_pilots
would wait forn
seconds before doing the kill which is not really intuitive either...
with delay inside cancel_pilots
it will always take at least that _timeout
seconds (e.g., >=30s, with _timeout=30
),
but moving that delay into kill_pilots
lets cancel_pilots
to finish as soon as pilot reaches the FINAL state and if it will not reach it during _timeout
then it will be "killed". Thus, you will let pilot to terminate on its own without enforcing minimal exec time for cancel_pilot
method
with that approach any attempt to directly call kill_pilots would wait for n seconds before doing the kill which is not really intuitive either
no, without providing delay it will not wait.
BTW maybe then we can add an additional input parameter?
def cancel_pilots(self, uids=None, _timeout=None, _kill_delay=None):
...
self.publish(rpc.CONTROL_PUBSUB, {'cmd' : 'kill_pilots',
'arg' : {'pmgr' : self.uid,
'uids' : uids,
'delay': _kill_delay}})
...
I think we are looking sideways at this. The main problem seems to be that we fall back to kill
on cancel
. Maybe we should leave that to the application? So, if RE wants to cancel it should call cancel
and then wait for final state with timeout. If that fails RE can call kill
. RP's session close really wants kill
semantics, so it should call kill
.
Maybe we should leave that to the application?
but we don't have "normal" termination then: Agent terminates only after receiving command cancel_pilot
(that is what triggers Agent to terminate gracefully and pilot finishes with the state DONE), but this command cancel_pilot
is pushed from PilotManager.cancel_pilots
method, which also pushes command kill_pilots
to kill the pilot job. Thus it is always a race condition which operation will finish first.
I think we are looking sideways at this. The main problem seems to be that we fall back to kill
on cancel
. Maybe we should leave that to the application? So, if RE wants to cancel it should call cancel
and then wait for final state with timeout. If that fails RE can call kill
. RP's session close really wants kill
semantics, so it should call kill
.
Merging #2962 (fac6032) into devel (6e25b61) will increase coverage by
0.19%
. The diff coverage is85.71%
.
@@ Coverage Diff @@
## devel #2962 +/- ##
==========================================
+ Coverage 41.45% 41.64% +0.19%
==========================================
Files 95 95
Lines 10482 10494 +12
==========================================
+ Hits 4345 4370 +25
+ Misses 6137 6124 -13
Impacted Files | Coverage Δ | |
---|---|---|
src/radical/pilot/pilot_manager.py | 18.55% <85.71%> (+6.05%) |
:arrow_up: |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
@mtitov : can you have a look at the last commit, please - something like that ok?
Sorry, I mixed up fix/pilot_termination
and this branch :-/
Summary from the discussion during devel call: this PR is a workaround of the current implementation, but to have a better control over termination mechanism it was agreed to split cancel_pilots
and kill_pilots
commands into separate methods.
@mtitov : can you have a look at the last commit, please - something like that ok?
@andre-merzky maybe to have both in PilotManager.close
under if terminate:
? (whichever will start first) and to have a separate cancel_pilots
?
@andre-merzky can you please approve it and merge?
@andre-merzky would it be ok to move that
timeout
intokill_pilots
method? thus we wouldn't wait for [fixed]_timeout
seconds, if pilot will reach the final state before that