radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

RAPTOR: raptor_master.py question #3003

Closed AymenFJA closed 1 year ago

AymenFJA commented 1 year ago

In raptor_master.py specifically in line https://github.com/radical-cybertools/radical.pilot/blob/10d25ebaf603f4a57ed08a129c310a1ebfb213d5/examples/misc/raptor_master.py#L301 although there is a comment, it is still not clear:

  1. why do we have sleep 60s?
  2. would that time need to be increased also when we have longer running tasks?
  3. is there a better approach besides exposing this directly?

This issue was raised while @mtitov and I were discussing this in line with the examples.

andre-merzky commented 1 year ago

That sleep is likely a left over from a version where the submit call was not blocking and waiting for task completeness, but where it just submitted the tasks. So yes, the sleep can indeed be removed now.

AymenFJA commented 1 year ago

Oh great, happy to do so. I will open a PR with that. Thanks.

Get Outlook for iOShttps://aka.ms/o0ukef


From: Andre Merzky @.> Sent: Thursday, August 3, 2023 5:47:44 PM To: radical-cybertools/radical.pilot @.> Cc: Aymen Alsaadi @.>; Author @.> Subject: Re: [radical-cybertools/radical.pilot] RAPTOR: raptor_master.py question (Issue #3003)

That sleep is likely a left over from a version where the submit call was not blocking and waiting for task completeness, but where it just submitted the tasks. So yes, the sleep can indeed be removed now.

— Reply to this email directly, view it on GitHubhttps://github.com/radical-cybertools/radical.pilot/issues/3003#issuecomment-1664686269, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGOJMHXN5LHGOC47BU2N6DTXTQMABANCNFSM6AAAAAA3DLHXAU. You are receiving this because you authored the thread.Message ID: @.***>

AymenFJA commented 1 year ago

@andre-merzky I see a different behavior which is originally highlighted by @mtitov. Removing the time.sleep(60) led to an instant termination of the master:

    master.wait_workers(count=1)
    master.start()
    master.submit()

    # TODO: can be run from thread?
    master.stop()

    # TODO: worker state callback
    master.join()
1691163340.425 : master.000000.worker.0000 : 5158  : 139873672492864 : DEBUG    : wait for registration to complete
1691163340.435 : master.000000.worker.0000 : 5112  : 140625709971264 : DEBUG    : register: master.000000.worker.0000 / master.000000
1691163340.436 : master.000000.worker.0000 : 5112  : 140625709971264 : DEBUG    : wait for registration to complete
1691163340.438 : master.000000.worker.0000 : 5112  : 140625709971264 : DEBUG    : registration with master ok
1691163340.438 : master.000000.worker.0000 : 5158  : 139873672492864 : DEBUG    : registration with master ok
1691163340.487 : master.000000.worker.0000 : 5176  : 140253084510016 : DEBUG    : wait for registration to complete
1691163340.487 : master.000000.worker.0000 : 5176  : 140253084510016 : DEBUG    : registration with master ok
1691163340.515 : master.000000.worker.0000 : 5120  : 140496532444992 : DEBUG    : wait for registration to complete
1691163340.515 : master.000000.worker.0000 : 5120  : 140496532444992 : DEBUG    : registration with master ok
1691163340.819 : master.000000.worker.0000 : 5158  : 139873593792256 : DEBUG    : worker_terminate signal
1691163340.820 : master.000000.worker.0000 : 5158  : 139873593792256 : ERROR    : callback error

Can you please clarify? Thanks.

andre-merzky commented 1 year ago

Hmm, that's unexpected. Let me step through that code to see what's going on...