radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

Problem stopping Raptor master in 1.36 #3001

Open eirrgang opened 1 year ago

eirrgang commented 1 year ago

Automated test jobs for scalems started failing recently for the devel branch of RP. See, for instance, https://github.com/SCALE-MS/scale-ms/actions/runs/5663049971/job/15569404817

it seems that the Raptor Master task scalems-rp-raptor.846fc04c-2b48-11ee-b1b6-8daf5ac26a8e should have received a message that told it to call self.stop() on itself from within a result_cb() . The Task carrying that message got marked DONE, but the Master task kept running for at least 20 seconds in state AGENT_EXECUTING. Later, it was successfully canceled with Task.cancel() . https://github.com/SCALE-MS/scale-ms/suites/14565990020/artifacts/824790031

It looks like master.stop() got called without an error and there is a log of the term getting set. Then the callback log message from the line after master.stop() logs its message. But the script doesn't record the log message from the line after the Master.join() (_raptor.join())

I'm going to leave that branch undisturbed for a while to give you a chance to look at it. I'm making some adjustments to the Master script in a different branch to move the Worker management out of the main script body. I'll let you know if I encounter something similar with a different script structure, but I'll also be interested to hear whatever you deduce.

eirrgang commented 1 year ago

Update: Since yesterday's release of 1.36, scalems tests fail against the RP official release.

andre-merzky commented 1 year ago

Thanks Eric, I'll check it out!