client.py deadlocks on BlockingKernelClient.get_iopub_msg()

ghost commented 6 years ago

It seems client.py sometimes deadlocks waiting on the lost messages for IOPub channel.

The IPython doc1 says that IOPub is a "broadcast channel" which I think implies that there will be no guarantee that the messages get actually delivered to client.py, and indeed they get quite often lost in my setup ([Emacs on macbook] <=[ssh tunnel]=> [IPython kernel on a VM instance in the GCP cloud])

A quickest hack to work around the deadlocks would be to change the first "while True:" to "while False:", in the function msg_router, at the cost of losing all the stdout/stderr outputs. But at least this gets the code executed without blocking indefinitely, and we have historical data of stdout/stderr available any time from the shell channel.

I guess the correct solution would be to regard any message from IOPub as a bonus that could get lost at any time, and only wait for history data from the shell channel.

ghost commented 6 years ago

It seems that separating the IOPub listener and the shell channel listner in different threads mostly works. I'm so busy at this moment but maybe I'll send a pull request this weekend.

ghost commented 6 years ago

I'm still busy at this moment, but let me share some findings on using ob-ipython for a cloud-based remote computation:

zmq cannot guarantee message delivery. Any message may get lost, particularly so in the case of inter-continental communication (such as Japan <-> US).
zmq does guarantee atomic message delivery. Either it delivers the whole message or nothing.

Consequently, using client.py through inter-continental ssh tunnel is very prone to deadlocks because it has to wait on zmq messages. My current workaround is to use zmq only as a local IPC, and use a synchronous protocols auch as ssh over networks. What I do is:

Install client.py on the remote server
Instead of python -- client.py --conn-file kernel.json --execute, use ssh remote-server python client.py --conn-file kernel.json --execute

I believe this is mostly sufficient for a deadlock-free, minimal frustration cloud ML development environment on ob-ipython. C-cC-vC-b now works like a charm now on the cloud. Thank you so much for a great piece of software!

gregsexton commented 6 years ago

This is an interesting analysis.

I think what you're saying is that zmq here is 'at most once'. This could be possible. I'm not much of a zmq expert. I thought it can be configured to work in many ways and is more of a protocol toolbox than a protocol in its own right. Not sure.

The jupyter application protocol that sits on top certainly seems to rely on reliable messaging.

How did you arrive at this conclusion? Is this based on observing behaviour or did you create a specific test to prove/disprove this?

gregsexton commented 6 years ago

Oh, maybe I misunderstood. Are you saying that IO only is best effort? This might be the case and something I didn't consider.

ghost commented 6 years ago

Unfortunately I too have virtually no experience in zmq. What I have gathered dealing with this issues are those found publicly online, such as this one apparently from the developers, and also from this zmq API doc which indeed says that the messages are either fully delivered or not at all. On the other hand, their RFC includes a section for Request-Reply pattern where a message SHALL get delivered exactly once. So apparently the problem is not in zmq in itself, but how it is used in Jupyter/IPython. As I said, the output from the IPython REPL is sent through the IOPub channel, which is a publishing channel that doesn't guarantee message delivery, it's just atomic.

ghost commented 6 years ago

I've come up with a relatively reasonable approach to avoid the above mentioned deadlock: use BlockingKernelClient.history() command which starts communication on the shell channel. Because the shell channel follows the request/reply pattern, it seems it could be used as somewhat more reliable progress checking API.

ghost commented 6 years ago

I didn't seem to answer your question. Yes, IOPub only is basically a best-effort publishing channel. Worse, it doesn't really try its best to deliver IO messages, it only tries its best to minimize latency. It drops a lot of IO messages, indeed, in the case of inter-continental networking. I've managed to implement my own ob-ipython--run-async based on comint-mode+ssh. Could you please take a look at my fork? I think it's much faster and more reliable than other proposals seen here. If you like it, I will prepare a PR.

gregsexton / ob-ipython

client.py deadlocks on BlockingKernelClient.get_iopub_msg() #164