Open maharjun opened 7 years ago
Thanks for getting the time to understand this error. I am sadly very busy with other matters, but I will gladly take any pull request proposing a solution for this.
Hello everyone,
I am not sure if the issue that maharjun describes is related with the one I am observing (I apologize in advance if this is not the case). But definitely, what I have is related that “SCOOP is Losing Futures” (see below). The weird thing is that when I use the version 0.7.1.1 (from pip) instead 0.7.2.0 (cloning the repository) I do not observe the error anymore. But unfortunately, I need other upgrades that they are only available in the last version. I will appreciate a lot any help any of you can give me.
All the best, Nestor
[2017-07-26 16:49:16,747] scoopzmq (192.168.2.23:52308) WARNING Lost track of future ('192.168.2.23:52308', 4):KFoldCrossValidation_runWorker((<Model.Model instance at 0x2ae4d309ecf8>, <Optimizer.Optimizer object at 0x2ae4d2f73cd0>),){}=None. Resending it...
(MainThread) Lost track of future ('192.168.2.23:52308', 4):KFoldCrossValidation_runWorker((<Model.Model instance at 0x2ae4d309ecf8>, <Optimizer.Optimizer object at 0x2ae4d2f73cd0>),){}=None. Resending it...
Traceback (most recent call last):
File "/usr/projects/hpcsoft/toss2/common/anaconda/4.1.1-python-2.7/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/projects/hpcsoft/toss2/common/anaconda/4.1.1-python-2.7/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "build/bdist.linux-x86_64/egg/scoop/bootstrap/main.py", line 298, in
I have the same problem for local workers with version 0.7.2.
[2023-08-27 12:32:44,600] scoopzmq (b'127.0.0.1:59141') WARNING Lost track of future (b'127.0.0.1:59141', 9):run_test_on_date('2023-08-25-08',){}=None. Resending it...
2023-08-27 12:32:44 WARNING SCOOPLogger Lost track of future (b'127.0.0.1:59141', 9):run_test_on_date('2023-08-25-08',){}=None. Resending it...
The workers are complete(according to the log in worker) but they are resent again and again...and the main process never ends.
As a result of the changes in master branch, it appears that SCOOP is losing futures without there being any communication issues. The following issues lead to lost futures.
execQueue.inprogress does not seem to be updated when a future that is popped from the queue is run (via runFuture) This means that for the duration that a job is running, all of the
ready
,movable
, andinprogress
do not contain the future that is being run. Now, if the asynchronous thread that performs futures reporting decides to send an update message during this phase, the executing future is not sent causing it to be erronously deleted from the assigned_tasks in the broker.The following sequence of events is an issue:
sendResult
is called on that future on the remote executing worker. This is results in aSTATUS_DONE
being sent to the broker which then (wrongly) deletes the future from assigned_tasks.self.askForPreviousFutures()
is called on the process that spawned this future (Note that this is possible as the processes are asynchronous). This then leads to the reporting of a 'Lost future'.Basically, the fact that the future is not delted on the originator worker before the STATUS_DONE is causing the error