soravux / scoop

SCOOP (Scalable COncurrent Operations in Python)
https://github.com/soravux/scoop
GNU Lesser General Public License v3.0
625 stars 88 forks source link

SCOOP is Losing Futures #64

Open maharjun opened 7 years ago

maharjun commented 7 years ago

As a result of the changes in master branch, it appears that SCOOP is losing futures without there being any communication issues. The following issues lead to lost futures.

  1. execQueue.inprogress does not seem to be updated when a future that is popped from the queue is run (via runFuture) This means that for the duration that a job is running, all of the ready, movable, and inprogress do not contain the future that is being run. Now, if the asynchronous thread that performs futures reporting decides to send an update message during this phase, the executing future is not sent causing it to be erronously deleted from the assigned_tasks in the broker.

  2. The following sequence of events is an issue:

    1. A future gets completed on a remote worker thread
    2. sendResult is called on that future on the remote executing worker. This is results in a STATUS_DONE being sent to the broker which then (wrongly) deletes the future from assigned_tasks.
    3. self.askForPreviousFutures() is called on the process that spawned this future (Note that this is possible as the processes are asynchronous). This then leads to the reporting of a 'Lost future'.

    Basically, the fact that the future is not delted on the originator worker before the STATUS_DONE is causing the error

soravux commented 7 years ago

Thanks for getting the time to understand this error. I am sadly very busy with other matters, but I will gladly take any pull request proposing a solution for this.

nfaguirrec commented 6 years ago

Hello everyone,

I am not sure if the issue that maharjun describes is related with the one I am observing (I apologize in advance if this is not the case). But definitely, what I have is related that “SCOOP is Losing Futures” (see below). The weird thing is that when I use the version 0.7.1.1 (from pip) instead 0.7.2.0 (cloning the repository) I do not observe the error anymore. But unfortunately, I need other upgrades that they are only available in the last version. I will appreciate a lot any help any of you can give me.

All the best, Nestor

[2017-07-26 16:49:16,747] scoopzmq (192.168.2.23:52308) WARNING Lost track of future ('192.168.2.23:52308', 4):KFoldCrossValidation_runWorker((<Model.Model instance at 0x2ae4d309ecf8>, <Optimizer.Optimizer object at 0x2ae4d2f73cd0>),){}=None. Resending it... (MainThread) Lost track of future ('192.168.2.23:52308', 4):KFoldCrossValidation_runWorker((<Model.Model instance at 0x2ae4d309ecf8>, <Optimizer.Optimizer object at 0x2ae4d2f73cd0>),){}=None. Resending it... Traceback (most recent call last): File "/usr/projects/hpcsoft/toss2/common/anaconda/4.1.1-python-2.7/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/projects/hpcsoft/toss2/common/anaconda/4.1.1-python-2.7/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "build/bdist.linux-x86_64/egg/scoop/bootstrap/main.py", line 298, in File "build/bdist.linux-x86_64/egg/scoop/bootstrap/main.py", line 92, in main File "build/bdist.linux-x86_64/egg/scoop/bootstrap/main.py", line 285, in run File "build/bdist.linux-x86_64/egg/scoop/bootstrap/main.py", line 266, in futures_startup File "build/bdist.linux-x86_64/egg/scoop/futures.py", line 65, in _startup File "build/bdist.linux-x86_64/egg/scoop/_control.py", line 273, in runController File "build/bdist.linux-x86_64/egg/scoop/_types.py", line 359, in pop File "build/bdist.linux-x86_64/egg/scoop/_types.py", line 382, in updateQueue File "build/bdist.linux-x86_64/egg/scoop/_comm/scoopzmq.py", line 352, in recvFuture File "build/bdist.linux-x86_64/egg/scoop/_comm/scoopzmq.py", line 269, in _recv File "build/bdist.linux-x86_64/egg/scoop/_comm/scoopzmq.py", line 369, in sendFuture File "build/bdist.linux-x86_64/egg/scoop/encapsulation.py", line 164, in pickleFileLike IOError: File not open for reading [2017-07-26 16:49:16,816] launcher (127.0.0.1:42167) INFO Root process is done. [2017-07-26 16:49:16,816] workerLaunch (127.0.0.1:42167) DEBUG Closing workers on wf535 (4 workers). [2017-07-26 16:49:16,816] brokerLaunch (127.0.0.1:42167) DEBUG Closing local broker. [2017-07-26 16:49:16,816] launcher (127.0.0.1:42167) INFO Finished cleaning spawned subprocesses.

RuralHunter commented 10 months ago

I have the same problem for local workers with version 0.7.2.

[2023-08-27 12:32:44,600] scoopzmq  (b'127.0.0.1:59141') WARNING Lost track of future (b'127.0.0.1:59141', 9):run_test_on_date('2023-08-25-08',){}=None. Resending it...
2023-08-27 12:32:44 WARNING SCOOPLogger Lost track of future (b'127.0.0.1:59141', 9):run_test_on_date('2023-08-25-08',){}=None. Resending it...

The workers are complete(according to the log in worker) but they are resent again and again...and the main process never ends.