"Could not send reply for job"

gobbedy commented 6 years ago

Hi again,

First off thanks so much for your help with the other issues I opened. So far I'm pleased with my trial of dispy.

I've tried the example code (with small changes) on several pairs of nodes, and it has worked without problem.

Until at some point on one pair of nodes after killing dispynode.py and relaunching it with --clean, I got a bunch of messages like this on the server node:

2018-07-18 16:37:27 dispynode - Could not send reply for job 140115792197096 to ('172.16.142.49', 51347); saving it in "/tmp/dispy/node/172.16.142.49/my_test_function_3azb4shn/_dispy_job_reply_140115792197096" 2018-07-18 16:37:27 dispynode - Could not send reply for job 140115792087104 to ('172.16.142.49', 51347); saving it in "/tmp/dispy/node/172.16.142.49/my_test_function_3azb4shn/_dispy_job_reply_140115792087104"

And on the client node, messages like this:

2018-07-18 16:38:22 dispy - Ignoring invalid reply for job 140115792085064 from 172.16.142.50 2018-07-18 16:38:22 dispy - Ignoring invalid reply for job 140115792085064 from 172.16.142.50 2018-07-18 16:38:22 dispy - Ignoring invalid reply for job 140115792085064 from 172.16.142.50 2018-07-18 16:38:22 dispy - Ignoring invalid reply for job 140115792085064 from 172.16.142.50

I have a gut feeling it somehow means the previous dispynode was not killed properly, but I have no idea how to debug.

Could you please provide guidance?

gobbedy commented 6 years ago

This has happened two more times, and I can provide a bit more information, in case it helps.

In every case, I connected to the client and server nodes (both identical nodes on a cluster) via a SLURM salloc session requesting 2 nodes. (SLURM is just a scheduler that assigns timeslots for nodes on a cluster)

My session on these nodes lasts 3 hours, at which point I'm kicked off and I must request a new session via SLURM.

When I request a new session I usually get the same 2 nodes. It's when I get the same nodes that the issue arises.

This doesn't happen otherwise. ie usually during one 3 hour session I can kill dispynode (using the 'kill' command) and restart it at will with no issues. I can even ssh out of the nodes and back in with no issue.

This is my theory:

At the end of the 3 hour session, SLURM kills whatever processes are running on the 2 nodes. It kills dispynode in some way that doesn't allow dispynode to exit gracefull (a kill -9 maybe).
When dispynode dies ungracefully it leaves behind some child processes.
When dispynode is relaunched on top of remnant child processes it behaves erratically.

Would that even make sense from what you know of dispynode?

If it does, any idea how I can properly kill the child processes of dispynode when the SLURM scheduler botched the kill job?

pgiri commented 6 years ago

If you start dispynode with --clean option, the node discards any pending job results saved and it shouldn't cause "Could not send reply for job" errors on the node.

The scheduler ignores any reply for a job that it discarded (e.g., after a node is deemed zombie), which is what you are seeing at the client. There are a few options. You can use reentrant=True parameter to JobCluster if your computation is reentrant. In that case, the scheduler will resubmit any failed jobs to new nodes found and will finish all jobs. Another option is to use recover_jobs function to get results for any jobs finished at the node but not received by client (although from your description it looks like this case doesn't apply).

If the nodes can be killed before jobs are done, there is not much that can be done to cleanly finish them all, except if computation is reentrant. Then as I suggested, each time, start dispynode with --clean and use reentrant=True to JobCluster. See documentation for more details.

gobbedy commented 6 years ago

@pgiri thanks as always for your fast reply.

Unfortunately I've been using the --clean option, but the "Could not send reply" are in spite of it.

This is the exact command I've been using: dispynode.py --clean --daemon &

If I'm not mistaken, the rest of your message pertains to the client side messages ("Ignoring invalid replies"), which I think will be fixed when the server works properly. I do appreciate the explanation and it caused me to google reentrancy, so I learned something. I don't believe my computation is re-entrant.

Lastly, to clarify, there were no jobs running when the nodes were killed. The only thing running was the dispynode script itself.

At the point I'll speak to the sysadmin to find out if there's anything I can do (some kind of reboot or simulated reboot before every session perhaps)

gobbedy commented 6 years ago

I have been avoiding backgrounding the script.

In other words instead of dispynode.py --clean --daemon &

I do dispynode.py --clean --daemon

To effectively background the script I launch it inside tmux.

For reasons I don't understand, the issue I was seeing does not occur when the script is run in the foreground. I can use the same node at will and there seems to be no recovery problem.

Closing as I have a workaround.

pgiri / dispy

"Could not send reply for job" #136