soravux / scoop

SCOOP (Scalable COncurrent Operations in Python)
https://github.com/soravux/scoop
GNU Lesser General Public License v3.0
625 stars 88 forks source link

MULTIPLE CHANGES to BUGFIX Dropped Futures #67

Open maharjun opened 7 years ago

maharjun commented 7 years ago

This is a surgical change in SCOOP that fixes many of its previous issues and finally makes it useful along with the communication error robustness.

I have slightly changed the run control loop, sendResult, made the deletes and appends more explicit, and fixed #65 fixed #66 (which was already fixed by #63) and fixed #64.

All tests seem to be passing except the SLURM tests.

Also added some documentation to the code.

maharjun commented 7 years ago

This still does not completely fix the problem as i have later found out. I will fix this over the weekend and add a description of the new logic in a file in the scoop root directory.

soravux commented 7 years ago

Thanks for those, I'll review them this week and merge them as soon as possible.

maharjun commented 7 years ago

The following is a summary of changes made to scoop

  1. Made deletes from futureDict and execQueue more explicit (i.e. they thow an error if the object to delete doesn't exist.)

  2. Added an isReady Flag which represents the state of having been processed collected by the source worker. A ready future has all references cleared except the one in execQueue.ready. Also, the ready queue now only contains futures for which isReady is set.

  3. Made the append functions more explicit to convey exactly what's going on

  4. Made execQueue.inprogress actually track futures that are in progress

  5. Made control loop clearer and added comments explaining some of the underlying mechanisms

  6. Made the fault tolerance scheme a heartbeat scheme.

    1. Each worker has a parallel process (not thread, see point 7 & 8) that sends the heartbeat every TIME_BETWEEN_HEARTBEATS.

    2. The broker checks every TASK_CHECK_INTERVAL seconds for a worker with assigned jobs that has not sent a heartbeat in the last TIME_BEFORE_LOSING_WORKER seconds. It then considers this worker dead and performs the following:

      1. Cleans up all futures created by that worker from self.assigned_tasks and self.unassigned_tasks.

      2. Requests a resend of all futures that were assigned to be executed on that worker (This could potentially fail if there is no connection to the host that spawned these processes. A warning is issued in this case)

      3. Deletes entry of this future from the self.heartbeat_times dict and the self.assigned_times.

  7. Made the local broker a python forked process instead of thread because running long C programs from pypthon causes python to be unable to switch threads

  8. Made the sending of heartbeat a forked process due to point 7. skipped heartbeats cause unintentionally dropped processes

  9. Changed the scheduling to be FCFS:

    1. Changed the updateQueue to remove the dependence on lowwatermark as that didnt make sense when we only fetch a job once one is completed (i.e. local queue is empty and greenlet has returned something). Also the time length only changes after having fetched a job.

    2. Changed the container for self.available_workers to a set to prevent spamming of futureRequests which may occur if the UpdateQueue receives replies instead of futures after sending the future request.

maharjun commented 6 years ago

Yeah, I had made a few more changes but not pushed it here. Will do today.

On Thu, Jul 27, 2017 at 3:24 AM, Néstor F. Aguirre <notifications@github.com

wrote:

@nfaguirrec requested changes on this pull request.

In the line 329 of scoop/_types.py the function _delete is still used "sending_future._delete()", however this function was removed in this same commit. Program fails with the error: AttributeError: 'Future' object has no attribute "_delete"

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/soravux/scoop/pull/67#pullrequestreview-52538761, or mute the thread https://github.com/notifications/unsubscribe-auth/AEXsfveb7my9RaZRSF0jaI1xYSwgw21sks5sR-bWgaJpZM4NzNlZ .

maharjun commented 6 years ago

Hmm.. Its strange, I can't seem to find he part of code that you're talking about. Are you sure you're not using an older version of this?

BTW, If you're running this on linux, you may be interested in trying out the IGITUGraz fork (you should use the branch TUGmaster). It has some system-dependent fixes that improve process cleanup during interruption and exceptions.

nfaguirrec commented 6 years ago

I found the error in this pull request. Not in a specific branch. I was interested in testing your modifications: "Fixing case where the resend request could fail if source worker is lost" and "Made the local broker a python forked process instead of thread". Thanks!

maharjun commented 6 years ago

The thing is there are some commits after that in this pull request. titled as follows

Changed Variable name status_times to heartbeat_times      (528081c)
Simplified FutureQueue.pop                                 (5c3e055)
MULTIPLE CHANGES Fixed Inefficient Scheduling (Issue #68)  (8de3f0c)

search for the commit ID's on this page you should find them. These commits contain the fixes to the issues you mentioned and more.

Enjoy!

maharjun commented 6 years ago

Interestingly, when I check the diff of ea14f32, I see that the very line you mention has been changed, so right now, I really have no clue where it is that you're seeing that patch of code.

This is a link to the file at that commit (you should be able to see the commit ID in the link) https://github.com/IGITUGraz/scoop/blob/ea14f3258b11b3315dbf10cf885a721a999edad3/scoop/_types.py

If you go to the line you specified (329) you'll see that there is no usage of _delete

PS: I think it would be easier for you if you simply pulled this branch into your local repo and tried it out. PPS: I have posted a list of known issues here: (https://github.com/soravux/scoop/issues/60#issuecomment-315826483)

nfaguirrec commented 6 years ago

OK. I will try it. Thanks a lot !