Open ddm-j opened 6 years ago
I am a bit confused. Does this problem happen with either reentrant=True
or reentrant=False
? It shouldn't happen with reentrant=False
. If it does, could it be that nodes are being restarted or network communication failed? Client and nodes periodically (at pulse_interval
) send heartbeat messages to make sure they are reachable. If they don't get pulse messages for 3 (or 5) interval period, client will assume node is dead. It is likely there is a mistake in this implementation. If you confirm that you are encountering this issue with reentrant=False
, I will take a look at it.
Have you tried monitoring cluster / nodes with httpd? This may help to see if a node is not reachable (sort nodes by update time in cluster).
I have looked at this a bit more and it seems what may have happened is that the client lost connection to node temporarily (long enough for client to deem node is zombie) and "finished" job by abandoning it, but later node came back and sent job reply but client ignored it as it is already done!
You can catch such jobs with either a callback (and check for status dispy.DispyJob.Abandoned
), or check job status after a job is done for that status. If this is indeed what happened, I think it can be fixed by not abandoning dead jobs in the hope that nodes may come back online (but at some point the scheduler has to give up I guess). See "TODO" comment around line 1864 in reschedule_jobs
in __init__.py
.
I'm submitting around 22400 jobs to a cluster managed by SLURM. I haven't had any problems while submitting low amounts of jobs, but with this massive amount of jobs after a while (about 10% through job list) I start getting error messages:
Ignoring invalid reply for job 47744128919376
Each job takes about 30s to 2min to complete, and I have used the
setup
function to load in the high memory data/code for the compute function so that it does not have to repeatedly load. The parameters passed to the function aredict
with size of 4, each entry being alist
ofint
having size 3; and alist
with two timestamps. I've tried settingdispynode.py --zombie_interval=X
andpulse_interval=Y
for various different values. I've triedreentrant=True
. None seem to be helping. I'm not sure if this is a memory issue with the size of thejobs
list containing the parametes. I might try to remove jobs from the list (or dict as in examples) as they are completed, but I need to be able to access the outputs of the jobs and sort their results for post-processing based onjob.id
. Can I do this using the callback function? The returns of each job can be rather large depending on the internal mechanisms of my computation function.