jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Other
616 stars 223 forks source link

Exception writing to websocket #516

Closed amangarg96 closed 5 years ago

amangarg96 commented 5 years ago

I am using Jupyter Enterprise Gateway in YARN Cluster mode, with a slight modification.

An NGINX proxy is being used over multiple Jupyter Enterprise Gateway servers, which sends users to different gateway servers by hashing over the hostname of the machine. (This was done as a quick-fix for #86)

When the kernel loses connection with the Gateway server, it tries to reconnect with the previous kernel. The Notebook server logs are as follows

[I 19:11:55.965 LabApp] Request kernel at: /api/kernels/fd5ac87c-378a-4531-a25d-8b0e858077a3 [I 19:11:56.058 LabApp] Kernel retrieved: {u'connections': 1, u'last_activity': u'2018-12-04T13:26:03.683070Z', u'execution_state': u'idle', u'id': u'fd5ac87c-378a-4531-a25d-8b0e858077a3', u'name': u'mlp.mlc.linux_debian_8_11.py27.mlc-base-xgboost'} [I 19:11:58.964 LabApp] Request list kernel specs at: /api/kernelspecs [I 19:17:41.815 LabApp] Request kernel at: /api/kernels/fd5ac87c-378a-4531-a25d-8b0e858077a3 [I 19:17:41.898 LabApp] Kernel retrieved: {u'connections': 0, u'last_activity': u'2018-12-04T13:26:03.683070Z', u'execution_state': u'idle', u'id': u'fd5ac87c-378a-4531-a25d-8b0e858077a3', u'name': u'mlp.mlc.linux_debian_8_11.py27.mlc-base-xgboost'} [I 19:18:11.819 LabApp] Request kernel at: /api/kernels/fd5ac87c-378a-4531-a25d-8b0e858077a3 [I 19:18:11.901 LabApp] Kernel retrieved: {u'connections': 0, u'last_activity': u'2018-12-04T13:26:03.683070Z', u'execution_state': u'idle', u'id': u'fd5ac87c-378a-4531-a25d-8b0e858077a3', u'name': u'mlp.mlc.linux_debian_8_11.py27.mlc-base-xgboost'} [I 19:18:22.711 LabApp] Request kernel at: /api/kernels/fd5ac87c-378a-4531-a25d-8b0e858077a3 [I 19:18:22.798 LabApp] Kernel retrieved: {u'connections': 0, u'last_activity': u'2018-12-04T13:26:03.683070Z', u'execution_state': u'idle', u'id': u'fd5ac87c-378a-4531-a25d-8b0e858077a3', u'name': u'mlp.mlc.linux_debian_8_11.py27.mlc-base-xgboost'} [E 19:19:16.312 LabApp] Exception writing message to websocket: [E 19:19:17.704 LabApp] Exception writing message to websocket: [E 19:19:18.872 LabApp] Exception writing message to websocket: [I 19:19:28.890 LabApp] Saving file at .ipynb [W 19:19:28.891 LabApp] Notebook .ipynb is not trusted [I 19:19:36.813 LabApp] Request list kernel specs at: /api/kernelspecs [I 19:20:37.810 LabApp] Request list kernel specs at: /api/kernelspecs [E 19:20:39.373 LabApp] Exception writing message to websocket:

The Notebook server seems to be able to contact with the Kernel through REST calls, but it's not able to connect to the websockets.

On the notebook, it shows that kernel is active (in idle state), but it doesn't execute the cell.

The following step is what is missing from the reconnection attempt, right?

Connecting to ws://10.33.11.68:9090/api/kernels/fd5ac87c-378a-4531-a25d-8b0e858077a3/channels

Does it have something to do with the NGINX proxy?

Also, what all messages are sent through Websockets? Is it possible to switch notebooks and enterprise to just HTTP (Since the REST calls are working just fine)? If so, what all functionality of notebooks would be affected? Pointers to the Documentation would also help.

kevin-bates commented 5 years ago

@amangarg96 - thanks for the issue - another interesting issue from you.

Use of a reverse proxy is something we recommend, so this shouldn't be a problem. In addition, your use of the client machine as the affinity key seems fine as well.

I'm curious your EG log is indicating during this period. Perhaps there's something indicated there, or, for that matter, the kernel-specific logs maintained in YARN. Please check those for any clues. Also, have you tried a forced reconnect operation from the notebook?

Regarding the separation of duties relative to the HTTP and WS, the HTTP requests essentially invoke the various manager classes to either get, start, interrupt, etc. data relative to a specific (or all) instance(s). These requests do not (necessarily) go directly to the kernel process itself. (Of course, things like interrupt or restart will implicitly trigger interaction with the kernel process.) .

The WS request handler communicates directly with the kernel. The channel embedded in the json body of the message indicates to which ZMQ port the request should be posted. Taking a look at our gateway_client.py file might help shed some light here. Unfortunately, I'm not enough of a web developer to answer your question regarding a switch, although I'm suspicious why it wasn't done that way in the first place. A quick google regarding their differences gives good reason for why WS is used - in particular bi-directional, full-duplex and far less overhead.

One area that will make things a little difficult, should you find you need to make changes, is that the EG doesn't define any handlers - all are inherited from the Kernel Gateway and Notebook projects, so this may open a can of worms for you. That said, it would be fine to define a subclass in EG that derives from the class you need to change, assuming that a change of that magnitude is warranted and can be done in a relatively clean way.

amangarg96 commented 5 years ago

I reproduced the above error and checked the kernel-specific logs in YARN. The stdout and stderr look fine to me, and I'm putting it here for your reference

stdout:

Using connection file '/tmp/kernel-0bc9ede3-229d-48b2-8025-5aaac8607002_IPOgwT.json' instead of '/root/.local/share/jupyter/runtime/kernel-0bc9ede3-229d-48b2-8025-5aaac8607002.json' Signal socket bound to host: 0.0.0.0, port: 41173 JSON Payload '{"stdin_port": 16842, "pgid": "2541", "ip": "0.0.0.0", "pid": "2593", "control_port": 9208, "hb_port": 46206, "signature_scheme": "hmac-sha256", "key": "11002c92-6daa-47f8-908f-d26d611eb300", "comm_port": 41173, "kernel_name": "", "shell_port": 62515, "transport": "tcp", "iopub_port": 13844} Encrypted Payload '1jWOPQH9wzE/6bjr7pJAv9bAC9MXhkvEMx0iDcZlniOyHkrRWl4e7avhMfD15yectln5b6uwhi3HGzukyqe+85BjboBlkzKNpVGRY56Y7Qf6i6k253NV2aWOi3V/9ry//bHXXR7pg5XIqxVzyQgzFl5xH+Edam8n9irNS6a1tnjtYcBQ/eH52LYiH2gtWe60JCcj2xAFNIteypVgZCrVgJYufow2RYLnlsCQAK1WLNaLPf02DehBmjtw/PfDEi0zHl4RDRPbLaG2lCzTnx3VHyADezyK3zXAhmblt55QA9tvmfMsCiB3HUaRWnfOlMTSfkSSdbXFdn0oQE0o9jLDCLw2ppAD9Cw+BQ8KNrXoH0DHIPwSFOPCj2TcQY3nZ1VV++Z9vUV92MADyTkKuBXxWA== /grid/1/yarn/local/usercache/fk-mlp-user/appcache/application_1543928474139_24402/container_e238_1543928474139_24402_01_000001/mlp.mlc.Linux_debian_8_11.py27.mlc-base-xgboost.tar.gz/lib/python2.7/site-packages/IPython/paths.py:69: UserWarning: IPython parent '/home' is not a writable location, using a temp directory. " using a temp directory.".format(parent)) NOTE: When using the ipython kernel entry point, Ctrl-C will not work.

To exit, you will have to explicitly quit this process, by either sending "quit" from a client, or using Ctrl-\ in UNIX-like environments.

To read more about this, see https://github.com/ipython/ipython/issues/2049

To connect another client to this kernel, use: --existing /tmp/kernel-0bc9ede3-229d-48b2-8025-5aaac8607002_IPOgwT.json

stderr:

YARN executor launch context: env: CLASSPATH -> {{PWD}}{{PWD}}/spark_conf{{PWD}}/spark_libs/$HADOOP_CONF_DIR/usr/hdp/current/hadoop-client//usr/hdp/current/hadoop-client/lib//usr/hdp/current/hadoop-hdfs-client//usr/hdp/current/hadoop-hdfs-client/lib//usr/hdp/current/hadoop-yarn-client//usr/hdp/current/hadoop-yarn-client/lib/$PWD/mr-framework/hadoop/share/hadoop/mapreduce/:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/:$PWD/mr-framework/hadoop/share/hadoop/common/:$PWD/mr-framework/hadoop/share/hadoop/common/lib/:$PWD/mr-framework/hadoop/share/hadoop/yarn/:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/:$PWD/mr-framework/hadoop/share/hadoop/hdfs/:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/current/hadoop-client/lib/hadoop-lzo-0.6.0.2.4.0.0-169.jar:/etc/hadoop/conf/secure SPARK_YARN_STAGING_DIR -> (redacted) SPARK_USER -> (redacted) SPARK_YARN_MODE -> true PYTHONPATH -> {{PWD}}/pyspark.zip{{PWD}}/py4j-0.10.4-src.zip{{PWD}}/mlsdk.zip

command: {{JAVA_HOME}}/bin/java \ -server \ -Xmx10240m \ -Djava.io.tmpdir={{PWD}}/tmp \ -Dspark.yarn.app.container.log.dir= \ -XX:OnOutOfMemoryError='kill %p' \ org.apache.spark.executor.CoarseGrainedExecutorBackend \ --driver-url \ spark://CoarseGrainedScheduler@10.32.194.211:11004 \ --executor-id \

\ --hostname \ \ --cores \ 1 \ --app-id \ application_1543928474139_24402 \ --user-class-path \ file:$PWD/__app__.jar \ 1>/stdout \ 2>/stderr resources: py4j-0.10.4-src.zip -> resource { scheme: "hdfs" host: "krios" port: -1 file: "/user/fk-mlp-user/.sparkStaging/application_1543928474139_24402/py4j-0.10.4-src.zip" } size: 74096 timestamp: 1544082647339 type: FILE visibility: PRIVATE mlp.mlc.Linux_debian_8_11.py27.mlc-base-xgboost.tar.gz -> resource { scheme: "hdfs" host: "krios" port: -1 file: "/user/fk-mlp-user/.sparkStaging/application_1543928474139_24402/mlp.mlc.Linux_debian_8_11.py27.mlc-base-xgboost.tar.gz" } size: 1683630994 timestamp: 1544082645796 type: ARCHIVE visibility: PRIVATE __spark_conf__ -> resource { scheme: "hdfs" host: "krios" port: -1 file: "/user/fk-mlp-user/.sparkStaging/application_1543928474139_24402/__spark_conf__.zip" } size: 102062 timestamp: 1544082647415 type: ARCHIVE visibility: PRIVATE pyspark.zip -> resource { scheme: "hdfs" host: "krios" port: -1 file: "/user/fk-mlp-user/.sparkStaging/application_1543928474139_24402/pyspark.zip" } size: 482687 timestamp: 1544082645943 type: FILE visibility: PRIVATE __spark_libs__ -> resource { scheme: "hdfs" host: "krios" port: -1 file: "/user/fk-mlp-user/.sparkStaging/application_1543928474139_24402/__spark_libs__8138412549744268625.zip" } size: 203821116 timestamp: 1544082638970 type: ARCHIVE visibility: PRIVATE mlsdk.zip -> resource { scheme: "hdfs" host: "krios" port: -1 file: "/user/fk-mlp-user/notebooks/mlsdk/mlsdk.zip" } size: 113787 timestamp: 1543586447744 type: FILE visibility: PUBLIC =============================================================================== 18/12/06 13:23:35 INFO yarn.YarnRMClient: Registering the ApplicationMaster 18/12/06 13:23:35 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 18/12/06 13:23:35 INFO yarn.ApplicationMaster: Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals
amangarg96 commented 5 years ago

I hadn't tried the reconnect option as I couldn't find it in JupyterLab. So I launched the same notebook (kernel) in the classic Notebooks (through JupyterLab). Then I tried reconnect and it worked!

I got the "connecting to websocket" log in Notebook server too

[I 14:43:05.903 LabApp] Connecting to ws://10.34.162.133:9090/api/kernels/0bc9ede3-229d-48b2-8025-5aaac8607002/channels

It's surprising why the reconnect option has been left out from JupyterLab

This seems to have solved my use-case, is there anything else we should troubleshoot?

Update: The 'Reconnect to kernel' option is available in the 'Command' palette of JupyterLab

kevin-bates commented 5 years ago

Thanks for that update. I was just about to post a question on the jupyterlab gitter forum. They sure don't make that easy to find!

Are you satisfied with this behavior? We'll likely revisit this area when we go to implement a robust HA solution.

amangarg96 commented 5 years ago

When the Notebook server tries to poll the state of the kernel (through REST calls), it should ideally attempt the 'reconnect' to the kernel (websockets) too. A reconnect attempt (I think) is a harmless activity and does not interfere with the state of the kernel, so should be invoked.

kevin-bates commented 5 years ago

Sounds like a good suggestion/contribution to the Notebook server. 😃

Since this is sounding more like a client-side issue, I'm inclined to close this issue for now. Should any activity occur in Notebook/Lab, we can post a reference here.

Are you okay with closure?

amangarg96 commented 5 years ago

Yes! I'm happy with the resolution. Thanks for the help :)