jupyter / notebook

Jupyter Interactive Notebook
https://jupyter-notebook.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
11.62k stars 4.88k forks source link

Websocket ping timeout #1474

Open Jeffalltogether opened 8 years ago

Jeffalltogether commented 8 years ago

I followed these instructions to set-up a Jupyter Notebook server on an Amazon EC2 instance. All works great, except when I run a block of code that requires a long execution time, greater than 2 or 3 min. As the kernel is busy running this code (I can see code executing due to a simple progress bar feature) it will stop all the sudden and display a websocket ping timeout error. The following are the messages I receive:

[I 17:22:19.083 NotebookApp] Serving notebooks from local directory: /home/ubuntu/Notebooks
[I 17:22:19.084 NotebookApp] 0 active kernels
[I 17:22:19.084 NotebookApp] The IPython Notebook is running at: https://[all ip addresses on your system]:8888/
[I 17:22:19.084 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[I 17:22:32.645 NotebookApp] 302 GET / (174.47.174.222) 0.72ms
[I 17:22:32.735 NotebookApp] 302 GET /tree (174.47.174.222) 0.93ms
[I 17:22:37.245 NotebookApp] 302 POST /login?next=%2Ftree (174.47.174.222) 0.92ms
[I 17:22:42.437 NotebookApp] Kernel started: 7b436e11-118d-4c28-9777-ec63baec0b5f
[W 17:24:13.097 NotebookApp] WebSocket ping timeout after 90000 ms.
[E 01:12:29.968 NotebookApp] Uncaught exception GET /api/kernels/7ca196a9-e64b-40dd-bd12-d8bc1a323686/channels?session_id=E2631BE0F605403986ED7D8387A07E99 (174.47.174.222)
    HTTPServerRequest(protocol='https', host='ec2-52-9-221-109.us-west-1.compute.amazonaws.com:8888', method='GET', uri='/api/kernels/7ca196a9-e64b-40dd-bd12-d8bc1a323686/channels?session_id=E2631BE0F605403986ED7D8387A07E99', version='HTTP/1.1', remote_ip='174.47.174.222', headers={'Origin': 'https://ec2-52-9-221-109.us-west-1.compute.amazonaws.com:8888', 'Upgrade': 'Websocket', 'Sec-Websocket-Version': '13', 'Connection': 'Upgrade', 'Sec-Websocket-Key': 'KMt575Kx/659v0lUZdkytA==', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; LCTE; rv:11.0) like Gecko', 'Host': 'ec2-52-9-221-109.us-west-1.compute.amazonaws.com:8888', 'Cookie': 'username-ec2-52-9-221-109-us-west-1-compute-amazonaws-com-8888="2|1:0|10:1463773579|62:username-ec2-52-9-221-109-us-west-1-compute-amazonaws-com-8888|48:ZThmOTlkNWItZGQ1Yy00YjlmLWExNGEtMmEyYzJkODNiMjU2|f323ab003cf23bfc7dd105e73f48ee970b4b26807e5ddaaafe221f9123d1ea65"', 'Cache-Control': 'no-cache'})
    Traceback (most recent call last):
      File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/tornado/web.py", line 1401, in _stack_context_handle_exception
        raise_exc_info((type, value, traceback))
      File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/tornado/stack_context.py", line 314, in wrapped
        ret = fn(*args, **kwargs)
      File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 184, in <lambda>
        self.on_recv(lambda msg: callback(self, msg), copy=copy)
      File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/notebook/base/zmqhandlers.py", line 188, in _on_zmq_reply
        self.write_message(msg, binary=isinstance(msg, bytes))
      File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/tornado/websocket.py", line 215, in write_message
        raise WebSocketClosedError()
    WebSocketClosedError
[I 17:24:42.542 NotebookApp] Saving file at /Untitled.ipynb

When accessing the server in Chrome or Internet Explorer I get the same messages.

Additionally (albeit very strange), when code is executing on the server my local laptop's CPU utilization goes up about 50%.

Any thoughts in this websocket timeout?

takluyver commented 8 years ago

Websocket pinging is used because certain proxies close a websocket if there are no messages over it for 60 seconds: we send a ping message every 30 seconds, and the browser sends a pong back. This is part of the websocket protocol, so the browser should do it automatically. If we don't get the pong back after 90 seconds, we assume that the connection is lost and kill it.

I can't think why code executing would affect that, but it's suspicious that it causes high CPU usage on the client. What progress bar library is the code using? Can you disable the progress bar and see if the behaviour still occurs?

Jeffalltogether commented 8 years ago

The progress bar is just part of the library's function. It's a neural network training protocol in Keras from keras.models import Sequential. As the network trains it shows progress through the training data with a bar in the Jupyter Notebook cell that looks like this:

Epoch 1/150 768/768 [========================> ] - 2s - loss: 0.6826 - acc: 0.6328

I don't think it has anything to do with this issue in particular as I have seen a number of other people with a similar issue.

It seems that when the notebook is in the (busy) state, the ping/pong messaging does not continue and essentially will only run code for as long as the websocket does not timeout. which is fine for short blocks of code.

I see that in the Jupyter Notebook code on Github in the file zmqhandlers.py there is a timeout if a message is not received after sending a ping, I believe this is what you mentioned in your reply. I am not familiar with this type of code, but is it possible to override this timeout when the notebook is in the "(busy)" state?

takluyver commented 8 years ago

is it possible to override this timeout when the notebook is in the "(busy)" state?

Not easily, and I'm pretty sure it's the wrong fix, in any case. The fact that the kernel is executing something shouldn't stop the browser from responding to websocket pings.

I have seen a number of other people with a similar issue.

That does look like the same issue that you're seeing, and @minrk is the person most likely to be able to work it out.

Jeffalltogether commented 8 years ago

Thanks I really appreciate your time in looking into this! Let's see if @minrk has any comments.

minrk commented 8 years ago

@Jeffalltogether what version of tornado do you have (might start with pip list or conda list to be safe)? It's odd that this timeout is raising an error in .get(), which should have returned before the timer started. That suggests that something isn't waiting the way we expect it to - perhaps because your tornado is too old, or perhaps it's a new version that changed something out from under us.

Jeffalltogether commented 8 years ago

Thanks for the prompt responses @minrk and @takluyver !!!

As it turns out, @takluyver was on the right track all along in asking about disabling the progress bar, because it turns out there is an error in the function I was using to train the model https://github.com/fchollet/keras/issues/2110

I was not receiving the error message in the notebook configuration described in the original post. In an attempt to get it working, I changed the configuration to run behind a Nginx web server on the same EC2 instance, and got the I/O error that was discussed in the link above.

Once I disabled the progress bar, both notebook configurations work.

My tornado version is from an anaconda 2.7 python build showing: tornado 4.3 py27_0 defaults

Regarding the local CPU issue. When the progress bar is disabled on that 'model.fit' function, the local CPU is not affected. However, in the Nginx + Jupyter configuration, the local CPU is not bothered with or without the progress bar disabled.

If anyone is interested in how I set-up the Nginx + Jupyter configuration on the EC2 instance, let me know.

Thanks again for addressing my issue!

takluyver commented 8 years ago

I think that I/O operation on closed file error is one we've seen before, though I forget if it got resolved. I'm still puzzled as to how an error in the kernel could cause the symptoms you describe, and why putting it behind nginx helped.

Jeffalltogether commented 8 years ago

If it helps, I have provided the code I was running in the .ipynb. It's a relatively small standard data set available for anyone to use. Source of the code is: Deep Learning with Python

from keras.models import Sequential
from keras.layers import Dense
import numpy
import pandas

seed = 7
numpy.random.seed(seed)

#Load Pima Indians dataset with Pandas from url
url = "https://goo.gl/vhm1eU"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names = names)
dataset = dataframe.values
print(dataset)

X = dataset[:,0:8]
Y = dataset[:,8]
model = Sequential()
model.add(Dense(12, input_dim=8, init='uniform', activation='relu'))
model.add(Dense(8, init='uniform', activation='relu'))
model.add(Dense(1, init='uniform', activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

#fit the model
# adding more epochs increase the number of times the training data is fed into the model and
# increases the time it takes to train.
model.fit(X,Y, nb_epoch=200, batch_size = 10, verbose = 0)

scores = model.evaluate(X, Y)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
Sandy4321 commented 8 years ago

I have the same problem on MAC , but even without running code, sessoin is closed after very short time and I do use very simple code and edit run it continuisly I do use c3.8large AWS instance output from my screen: [I 14:31:04.007 NotebookApp] Saving file at /sandercode/S_may31_try1.ipynb [I 14:44:26.956 NotebookApp] Saving file at /sandercode/S_may31_try1.ipynb sandercode/.Timeout, server 54.82.67.170 not responding. [W 15:05:27.008 NotebookApp] WebSocket ping timeout after 119965 ms.

may be it is MAC poblem per this comment:

interesting comment : Well I just tried connection to my notebook server on AWS from a desktop browser and it works perfectly.. Looks like its some issue with the iOS browsers not allowing a connection with a self-signed certificate (https). I also tested that it works fine with just http. From https://github.com/jupyter/help/issues/23

zweicoder commented 7 years ago

Similarly when running a remote notebook, accessed via an ssh tunnel, closing the ssh tunnel will cause a timeout and stop all computation. (Related Stackoverflow post)

I was hoping to run computationally intensive training operations remotely and reconnect after a while but it seems that this still doesn't work?

ashishsingal1 commented 7 years ago

I was getting this error repeatedly when running a long script that iterated through about 30k loops, each time printing out a completed message. When I commented out the print, I did not get the timeout error -- potential temporary solution.

nateGeorge commented 7 years ago

I'm having the same problem with running a Keras model over ssh, and it's super annoying. Could we change the timeout for cells as in here: http://nbconvert.readthedocs.io/en/stable/execute_api.html ?

You can also send the output to a file like so:

import sys
sys.stdout = open('keras_output.txt', 'w')
history = model.fit(X, y_cat, batch_size=128, nb_epoch=200, verbose=1)
sys.stdout = sys.__stdout__

That worked for me. http://stackoverflow.com/questions/4675728/redirect-stdout-to-a-file-in-python

Or you could turn the verbosity option to 0

brianlan commented 7 years ago

@nateGeorge Thanks, the redirect solution worked for me.

simonm3 commented 7 years ago

A lot of people run jupyter with keras and this bug is over a year old now. Any way of getting this fixed as it a big pain running keras models with no progress bars? I note it says "needs info" but not sure what that means.

nateGeorge commented 7 years ago

Could also use keras-tqdm maybe https://github.com/bstriner/keras-tqdm

simonm3 commented 7 years ago

I do use tqdm. But it is an extra step to specify tqdm progress bars on every call to fit functions; and another step to save the notebook with the bars. Does not feel like a proper solution.

Is it hard to fix? On 20 Jun 2017 12:03 a.m., "Nate George" notifications@github.com wrote:

Could also use keras-tqdm maybe https://github.com/bstriner/keras-tqdm

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jupyter/notebook/issues/1474#issuecomment-309597509, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJN6crffcUkhit3tQOV47YnhGDEDfcKks5sFv43gaJpZM4Ijq2W .

simonm3 commented 7 years ago

This bug is not just an issue with fit but also with downloading applications. In this case the progress bar cannot be turned off.e.g.

 from keras.applications.vgg16 import VGG16

On 20 June 2017 at 00:10, simon mackenzie simonm3@gmail.com wrote:

I do use tqdm. But it is an extra step to specify tqdm progress bars on every call to fit functions; and another step to save the notebook with the bars. Does not feel like a proper solution.

Is it hard to fix? On 20 Jun 2017 12:03 a.m., "Nate George" notifications@github.com wrote:

Could also use keras-tqdm maybe https://github.com/bstriner/keras-tqdm

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jupyter/notebook/issues/1474#issuecomment-309597509, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJN6crffcUkhit3tQOV47YnhGDEDfcKks5sFv43gaJpZM4Ijq2W .

jlintusaari commented 7 years ago

This WebSocket ping timeout also occurs frequently if you output a considerable amount of logging messages to the cell in a long running job (not with Keras).

saksham789 commented 6 years ago

Hey, I am using Jupyter Notebook on a local server.But it crashes very frequently stating this error

Websocket ping timeout after 1290004 ms

This happens with any random piece of code in the script and a potential solution I found was starting the browser again but it's very annoying going through the process time and again.Please help!!

kaiaeberli commented 6 years ago

try disabling windows firewall for your private network (if you are running the notebook on localhost:8888), that solved it for me.

CodeOfStoyan commented 5 years ago

I am pasting what worked for me(using the library TQDMN):

from keras_tqdm import TQDMNotebookCallback

keras, model definition...

model.fit(X_train, Y_train, verbose=0, callbacks=[TQDMNotebookCallback()])

dkyol commented 5 years ago

I received the following any thoughts?

`[W 18:03:00.631 NotebookApp] WebSocket ping timeout after 116328 ms.

^C[E 18:04:00.540 NotebookApp] Exception in callback (<socket._socketobject obje                                                                                                             ct at 0x7fefd0187440>, <function null_wrapper at 0x7fefd0123938>)
    Traceback (most recent call last):
      File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/tornado/ioloop.                                                                                                             py", line 1073, in start
        handler_func(fd_obj, events)
      File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/tornado/stack_c                                                                                                             ontext.py", line 300, in null_wrapper
        return fn(*args, **kwargs)
      File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/tornado/netutil                                                                                                             .py", line 249, in accept_handler
        connection, address = sock.accept()
      File "/home/ec2-user/anaconda2/lib/python2.7/socket.py", line 207, in acce                                                                                                             pt
        return _socketobject(_sock=sock), addr
      File "/home/ec2-user/anaconda2/lib/python2.7/socket.py", line 194, in __in                                                                                                             it__
        setattr(self, method, getattr(_sock, method))
      File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/notebook/notebo                                                                                                             okapp.py", line 1483, in _handle_sigint
        thread.start()
      File "/home/ec2-user/anaconda2/lib/python2.7/threading.py", line 736, in s                                                                                                             tart
        _start_new_thread(self.__bootstrap, ())
    error: can't start new thread

` I removed a for loop as discussed above and loaded a pickle file, and stopped receiving a timeout. Thanks

dshakey commented 5 years ago

Any update on this issue. Getting in k8s jupyter lab pod

maya-harel commented 4 years ago

Hi I am using jupyter/datascience-notebook:1386e2046833 on Jupyterhub (with EKS on AWS) and still having this issue

getting kernel restarts after getting a timeout from the websocket,

SingleUserLabApp zmqhandlers:182] WebSocket ping timeout after 90002 ms
SingleUserLabApp kernelmanager:217] Starting buffering for 533a90f9-e00f-4019-8044-59727faba7a5:de0312ca-a1f7-478b-a24a-1fe22593ec5f
kernelmanager:172] Kernel started: 1d7deac9-4b49-49c4-913c-490b6cb1d754
Sandy4321 commented 4 years ago

Good question

jhgoebbert commented 4 years ago

We came across the issue with SingleUserLabApp zmqhandlers:182] WebSocket ping timeout after 90002 ms, too
and could find out that the reason was the location of the notebook-file in a directory with thousands of files and ~80GB size.

While JupyterLab (not the notebook code) analyses all the content of the directory other important functionality is simply blocked and gets no compute time.
It is to me currently unclear what part/extension of JupyterLab is reponsible for this.

SovereignRemedy commented 4 years ago

We came across the issue with SingleUserLabApp zmqhandlers:182] WebSocket ping timeout after 90002 ms, too and could find out that the reason was the location of the notebook-file in a directory with thousands of files and ~80GB size.

While JupyterLab (not the notebook code) analyses all the content of the directory other important functionality is simply blocked and gets no compute time. It is to me currently unclear what part/extension of JupyterLab is reponsible for this.

how to slove it? I often encounter this problem websocket ping time out and lab terminal Not responding

my environment : jupyterhub0.81 jupyterlab1.2.9 Notebook:5.7.8

rapidnitin commented 1 year ago

Any update on this issue?