Ignoring Ivalid Reply/ Could Not Send Reply for Job -- 22k jobs

ddm-j commented 6 years ago

I'm submitting around 22400 jobs to a cluster managed by SLURM. I haven't had any problems while submitting low amounts of jobs, but with this massive amount of jobs after a while (about 10% through job list) I start getting error messages:

Ignoring invalid reply for job 47744128919376

Each job takes about 30s to 2min to complete, and I have used the setup function to load in the high memory data/code for the compute function so that it does not have to repeatedly load. The parameters passed to the function are dict with size of 4, each entry being a list of int having size 3; and a list with two timestamps. I've tried setting dispynode.py --zombie_interval=X and pulse_interval=Y for various different values. I've tried reentrant=True. None seem to be helping. I'm not sure if this is a memory issue with the size of the jobs list containing the parametes. I might try to remove jobs from the list (or dict as in examples) as they are completed, but I need to be able to access the outputs of the jobs and sort their results for post-processing based on job.id. Can I do this using the callback function? The returns of each job can be rather large depending on the internal mechanisms of my computation function.

pgiri commented 6 years ago

I am a bit confused. Does this problem happen with either reentrant=True or reentrant=False? It shouldn't happen with reentrant=False. If it does, could it be that nodes are being restarted or network communication failed? Client and nodes periodically (at pulse_interval) send heartbeat messages to make sure they are reachable. If they don't get pulse messages for 3 (or 5) interval period, client will assume node is dead. It is likely there is a mistake in this implementation. If you confirm that you are encountering this issue with reentrant=False, I will take a look at it.

Have you tried monitoring cluster / nodes with httpd? This may help to see if a node is not reachable (sort nodes by update time in cluster).

pgiri commented 6 years ago

I have looked at this a bit more and it seems what may have happened is that the client lost connection to node temporarily (long enough for client to deem node is zombie) and "finished" job by abandoning it, but later node came back and sent job reply but client ignored it as it is already done!

You can catch such jobs with either a callback (and check for status dispy.DispyJob.Abandoned), or check job status after a job is done for that status. If this is indeed what happened, I think it can be fixed by not abandoning dead jobs in the hope that nodes may come back online (but at some point the scheduler has to give up I guess). See "TODO" comment around line 1864 in reschedule_jobs in __init__.py.

pgiri / dispy

Ignoring Ivalid Reply/ Could Not Send Reply for Job -- 22k jobs #142