jupyter / nb2kg

Other
73 stars 31 forks source link

Timeout during request in several conditions #34

Closed Litchilitchy closed 5 years ago

Litchilitchy commented 5 years ago

I am using nb2kg via pip install "git+https://github.com/jupyter-incubator/nb2kg.git#egg=nb2kg" in single user jupyterhub (so I guess it is the same behavior with jupyter notebook)

I found there are several conditions I get Timeout error, I guess they are due to nb2kg

1) Open multiple notebooks the same time, like during the first one is created via "New" button, but kernel not ready, and open the second one at this time. In this case, at most 1 kernel will be ready and the others would get this error.

2) use --jars in Spark Config to upload a large file. (like in my case 350mb). This time I create them one by one (not create until the previous one is ready or fail). I created 5 notebooks, only the second one works, and the others get error.

The error seems to be the same.

Traceback (most recent call last):
  File "/opt/work/conda/lib/python3.6/site-packages/tornado/web.py", line 1699, in _execute
result = await result
  File "/opt/work/conda/lib/python3.6/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/opt/work/conda/lib/python3.6/site-packages/notebook/services/sessions/handlers.py", line 73, in post
type=mtype))
  File "/opt/work/conda/lib/python3.6/site-packages/tornado/gen.py", line 729, in run
value = future.result()
  File "/opt/work/conda/lib/python3.6/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/opt/work/conda/lib/python3.6/site-packages/nb2kg/managers.py", line 370, in create_session
session_id, path, name, type, kernel_name,
  File "/opt/work/conda/lib/python3.6/site-packages/tornado/gen.py", line 729, in run
value = future.result()
  File "/opt/work/conda/lib/python3.6/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/opt/work/conda/lib/python3.6/site-packages/notebook/services/sessions/sessionmanager.py", line 92, in start_kernel_for_session
self.kernel_manager.start_kernel(path=kernel_path, kernel_name=kernel_name)
  File "/opt/work/conda/lib/python3.6/site-packages/tornado/gen.py", line 729, in run
value = future.result()
  File "/opt/work/conda/lib/python3.6/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/opt/work/conda/lib/python3.6/site-packages/nb2kg/managers.py", line 150, in start_kernel
response = yield fetch_kg(self.kernels_endpoint, method='POST', body=json_body)
  File "/opt/work/conda/lib/python3.6/site-packages/tornado/gen.py", line 729, in run
value = future.result()
  File "/opt/work/conda/lib/python3.6/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/opt/work/conda/lib/python3.6/site-packages/nb2kg/managers.py", line 65, in fetch_kg
response = yield client.fetch(url, **kwargs)
  File "/opt/work/conda/lib/python3.6/site-packages/tornado/gen.py", line 729, in run
value = future.result()
tornado.simple_httpclient.HTTPTimeoutError: Timeout during request

So any idea?

By the way I fork the newest nb2kg and find that client.fetch actually does not call future.result(). So is nb2kg in pip obsolete?

Litchilitchy commented 5 years ago

I guess the 2) issue maybe to timeout because we need some time to upload a big file. However I set sys env variable KG_REQUEST_TIMEOUT to very large (300000) but seems does not work.

kevin-bates commented 5 years ago

There are a number of reasons you can experience timeouts during kernel startup - especially when remote kernels are in play via Enterprise Gateway. To better diagnose the nature of these issues, please provide the complete log file (with DEBUG enabled) from the Enterprise Gateway instance. The reason we'd like the entire file is to capture the complete request - from start thru error.

Here are a few items worth noting...

  1. Kernel startup is currently synchronous! As a result, if two kernels are started at roughly the same time, the second request received will not be processed until the first has completed. This issue is systemic to the jupyter stack. I've submitted PRs to jupyter_client, notebook and Enterprise Gateway to introduce a means of switching over to async methods, but each of the lower level PRs must first appear in a release before the next level can take advantage. In addition, the base (jupyter_client) PR is not getting traction right now and that needs to happen first. This set of PRs is critical to Enterprise Gateway.
  2. Although KG_REQUEST_TIMEOUT should be increased in these kinds of scenarios and is used by the underlying http classes, Enterprise Gateway uses KERNEL_LAUNCH_TIMEOUT as its means of determining when to "give up" on monitoring whether the remote kernel has started. It is recommended that KERNEL_LAUNCH_TIMEOUT be set to KG_REQUEST_TIMEOUT - although, technically, it should be slightly less because you don't want the request to timeout prior to the kernel launch. The default for KERNEL_LAUNCH_TIMEOUT is 30 seconds - so if you're seeing your Timeout error in roughly 30 seconds - that's why. (A valid addition for NG2KG would be to set KERNEL_LAUNCH_TIMEOUT to KG_REQUEST_TIMEOUT iff the former is not already set in the env and the latter is. Just remember that very large timeouts take away from the startup of subsequent kernels during that time - until we can get the async PRs merged. 🙁
  3. Resources may be limited. We see this in YARN where after a number of kernels are running, attempts to start a subsequent kernel can timeout if there are no resources for YARN to start the application (kernel). This may require increasing resources or tuning your kernelspec files to adjust, for example, SPARK_OPTS to use less resources.
  4. Configuration issues on the EG side. For example, if using the DistributedProcessProxy and don't have password-less ssh configured. Or you're missing IPython or R kernel installations on the worker nodes, etc. That said, these will typically result in 500 error codes.

At any rate, if you can post your EG log (with DEBUG) we can take a closer look - although setting KERNEL_LAUNCH_TIMEOUT may get you moving forward.

Litchilitchy commented 5 years ago

The full log

[D 2019-04-26 10:23:05.617 EnterpriseGatewayApp] Searching ['/opt/work', '/root/.jupyter', '/opt/work/conda/etc/jupyter', '/usr/local/etc/jupyter', '/etc/jupyter'] for config files
[D 2019-04-26 10:23:05.618 EnterpriseGatewayApp] Looking for jupyter_config in /etc/jupyter
[D 2019-04-26 10:23:05.618 EnterpriseGatewayApp] Looking for jupyter_config in /usr/local/etc/jupyter
[D 2019-04-26 10:23:05.618 EnterpriseGatewayApp] Looking for jupyter_config in /opt/work/conda/etc/jupyter
[D 2019-04-26 10:23:05.618 EnterpriseGatewayApp] Looking for jupyter_config in /root/.jupyter
[D 2019-04-26 10:23:05.618 EnterpriseGatewayApp] Looking for jupyter_config in /opt/work
[D 2019-04-26 10:23:05.619 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /etc/jupyter
[D 2019-04-26 10:23:05.620 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /usr/local/etc/jupyter
[D 2019-04-26 10:23:05.620 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /opt/work/conda/etc/jupyter
[D 2019-04-26 10:23:05.620 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /root/.jupyter
[D 2019-04-26 10:23:05.620 EnterpriseGatewayApp] Loaded config file: /root/.jupyter/jupyter_enterprise_gateway_config.py
[D 2019-04-26 10:23:05.621 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /opt/work
[D 190426 10:23:05 selector_events:65] Using selector: EpollSelector
[I 2019-04-26 10:23:05.632 EnterpriseGatewayApp] Jupyter Enterprise Gateway 1.2.0 is available at http://0.0.0.0:8888
[D 2019-04-26 10:23:15.643 EnterpriseGatewayApp] Found kernel python3 in /opt/work/conda/share/jupyter/kernels
[D 2019-04-26 10:23:15.644 EnterpriseGatewayApp] Found kernel spark_r_conductor_cluster in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:15.644 EnterpriseGatewayApp] Found kernel spark_python_yarn_cluster in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:15.644 EnterpriseGatewayApp] Found kernel spark_python_conductor_cluster in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:15.644 EnterpriseGatewayApp] Found kernel spark_r_yarn_client in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:15.644 EnterpriseGatewayApp] Found kernel spark_scala_conductor_cluster in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:15.644 EnterpriseGatewayApp] Found kernel spark_python_yarn_client in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:15.644 EnterpriseGatewayApp] Found kernel spark_scala_yarn_client in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:15.644 EnterpriseGatewayApp] Found kernel python_distributed in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:15.644 EnterpriseGatewayApp] Found kernel spark_r_yarn_cluster in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:15.644 EnterpriseGatewayApp] Found kernel spark_scala_yarn_cluster in /usr/local/share/jupyter/kernels
[I 190426 10:23:15 web:2246] 200 GET /api/kernelspecs (127.0.0.1) 1115.14ms
[D 2019-04-26 10:23:18.218 EnterpriseGatewayApp] Found kernel python3 in /opt/work/conda/share/jupyter/kernels
[D 2019-04-26 10:23:18.219 EnterpriseGatewayApp] Found kernel spark_r_conductor_cluster in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:18.219 EnterpriseGatewayApp] Found kernel spark_python_yarn_cluster in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:18.219 EnterpriseGatewayApp] Found kernel spark_python_conductor_cluster in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:18.220 EnterpriseGatewayApp] Found kernel spark_r_yarn_client in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:18.220 EnterpriseGatewayApp] Found kernel spark_scala_conductor_cluster in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:18.220 EnterpriseGatewayApp] Found kernel spark_python_yarn_client in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:18.220 EnterpriseGatewayApp] Found kernel spark_scala_yarn_client in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:18.220 EnterpriseGatewayApp] Found kernel python_distributed in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:18.220 EnterpriseGatewayApp] Found kernel spark_r_yarn_cluster in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:18.220 EnterpriseGatewayApp] Found kernel spark_scala_yarn_cluster in /usr/local/share/jupyter/kernels
[I 190426 10:23:18 web:2246] 200 GET /api/kernelspecs (127.0.0.1) 23.02ms
[D 2019-04-26 10:23:18.365 EnterpriseGatewayApp] RemoteMappingKernelManager.start_kernel: spark_python_yarn_cluster, kernel_username: root
[D 2019-04-26 10:23:18.392 EnterpriseGatewayApp] Instantiating kernel 'Spark - Python (YARN Cluster Mode)' with process proxy: enterprise_gateway.services.processproxies.yarn.YarnClusterProcessProxy
[D 2019-04-26 10:23:18.508 EnterpriseGatewayApp] Response socket launched on 172.168.2.121, port: 53670 using 300000.0s timeout
[D 2019-04-26 10:23:18.508 EnterpriseGatewayApp] YarnClusterProcessProxy shutdown wait time adjusted to 15.0 seconds.
[D 2019-04-26 10:23:18.509 EnterpriseGatewayApp] Starting kernel: ['/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh', '/root/.local/share/jupyter/runtime/kernel-2ce0e70f-1f67-4d82-8f5c-64965d2ccce9.json', '--RemoteProcessProxy.response-address', '172.168.2.121:53670', '--RemoteProcessProxy.port-range', '0..0', '--RemoteProcessProxy.spark-context-initialization-mode', 'lazy']
[D 2019-04-26 10:23:18.510 EnterpriseGatewayApp] Launching kernel: Spark - Python (YARN Cluster Mode) with command: ['/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh', '/root/.local/share/jupyter/runtime/kernel-2ce0e70f-1f67-4d82-8f5c-64965d2ccce9.json', '--RemoteProcessProxy.response-address', '172.168.2.121:53670', '--RemoteProcessProxy.port-range', '0..0', '--RemoteProcessProxy.spark-context-initialization-mode', 'lazy']
[D 2019-04-26 10:23:18.510 EnterpriseGatewayApp] BaseProcessProxy.launch_process() env: {'PATH': '/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/conda/bin:/opt/work/conda/condabin:/opt/work/hadoop-2.7.2/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/jdk/bin', 'KERNEL_WORKING_DIR': '/home/s', 'KERNEL_USERNAME': 'root', 'SPARK_HOME': '/opt/work/spark-2.4.0-bin-hadoop2.7', 'PYSPARK_PYTHON': '/opt/work/conda/bin/python3', 'PYTHONPATH': '/opt/work/conda/lib/python3.6/site-packages:/opt/work/spark-2.4.0-bin-hadoop2.7/python:/opt/work/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip', 'SPARK_OPTS': '--executor-cores 1 --jars /opt/work/lib/analytics-zoo-bigdl_0.7.2-spark_2.4.0-0.5.0-SNAPSHOT-jar-with-dependencies.jar --master yarn --deploy-mode cluster --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID} --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=/home/${KERNEL_USERNAME}/.local --conf spark.yarn.appMasterEnv.PYTHONPATH=/opt/work/conda/lib/python3.6/site-packages:/opt/work/spark-2.4.0-bin-hadoop2.7/python:/opt/work/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip --conf spark.yarn.appMasterEnv.PATH=/opt/work/conda/bin:$PATH', 'LAUNCH_OPTS': '', 'KERNEL_GATEWAY': '1', 'EG_MIN_PORT_RANGE_SIZE': '1000', 'EG_MAX_PORT_RANGE_RETRIES': '5', 'KERNEL_ID': '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9', 'EG_IMPERSONATION_ENABLED': 'False'}
[D 2019-04-26 10:23:18.515 EnterpriseGatewayApp] Yarn cluster kernel launched using YARN endpoint: http://Gondolin-Node-021:8188/ws/v1/cluster, pid: 93033, Kernel ID: 2ce0e70f-1f67-4d82-8f5c-64965d2ccce9, cmd: '['/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh', '/root/.local/share/jupyter/runtime/kernel-2ce0e70f-1f67-4d82-8f5c-64965d2ccce9.json', '--RemoteProcessProxy.response-address', '172.168.2.121:53670', '--RemoteProcessProxy.port-range', '0..0', '--RemoteProcessProxy.spark-context-initialization-mode', 'lazy']'

Starting IPython kernel for Spark in Yarn Cluster mode on behalf of user root

+ eval exec /opt/work/spark-2.4.0-bin-hadoop2.7/bin/spark-submit '--executor-cores 1 --jars /opt/work/lib/analytics-zoo-bigdl_0.7.2-spark_2.4.0-0.5.0-SNAPSHOT-jar-with-dependencies.jar --master yarn --deploy-mode cluster --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID} --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=/home/${KERNEL_USERNAME}/.local --conf spark.yarn.appMasterEnv.PYTHONPATH=/opt/work/conda/lib/python3.6/site-packages:/opt/work/spark-2.4.0-bin-hadoop2.7/python:/opt/work/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip --conf spark.yarn.appMasterEnv.PATH=/opt/work/conda/bin:$PATH' '' /usr/local/share/jupyter/kernels/spark_python_yarn_cluster/scripts/launch_ipykernel.py '' /root/.local/share/jupyter/runtime/kernel-2ce0e70f-1f67-4d82-8f5c-64965d2ccce9.json --RemoteProcessProxy.response-address 172.168.2.121:53670 --RemoteProcessProxy.port-range 0..0 --RemoteProcessProxy.spark-context-initialization-mode lazy
++ exec /opt/work/spark-2.4.0-bin-hadoop2.7/bin/spark-submit --executor-cores 1 --jars /opt/work/lib/analytics-zoo-bigdl_0.7.2-spark_2.4.0-0.5.0-SNAPSHOT-jar-with-dependencies.jar --master yarn --deploy-mode cluster --name 2ce0e70f-1f67-4d82-8f5c-64965d2ccce9 --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=/home/root/.local --conf spark.yarn.appMasterEnv.PYTHONPATH=/opt/work/conda/lib/python3.6/site-packages:/opt/work/spark-2.4.0-bin-hadoop2.7/python:/opt/work/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip --conf spark.yarn.appMasterEnv.PATH=/opt/work/conda/bin:/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/conda/bin:/opt/work/conda/condabin:/opt/work/hadoop-2.7.2/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/jdk/bin /usr/local/share/jupyter/kernels/spark_python_yarn_cluster/scripts/launch_ipykernel.py /root/.local/share/jupyter/runtime/kernel-2ce0e70f-1f67-4d82-8f5c-64965d2ccce9.json --RemoteProcessProxy.response-address 172.168.2.121:53670 --RemoteProcessProxy.port-range 0..0 --RemoteProcessProxy.spark-context-initialization-mode lazy
[I 190426 10:23:19 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:19 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:19 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:19.091 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:19 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:19 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:19 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:19.602 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
2019-04-26 10:23:20 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[I 190426 10:23:20 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:20 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:20 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:20.109 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:20 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:20 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:20 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:20.616 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
2019-04-26 10:23:20 INFO  RMProxy:98 - Connecting to ResourceManager at Gondolin-Node-021/172.168.2.121:8032
[I 190426 10:23:21 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:21 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:21 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:21.123 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
2019-04-26 10:23:21 INFO  Client:54 - Requesting a new application from cluster with 2 NodeManagers
2019-04-26 10:23:21 INFO  Client:54 - Verifying our application has not requested more than the maximum memory capability of the cluster (200000 MB per container)
2019-04-26 10:23:21 INFO  Client:54 - Will allocate AM container, with 1408 MB memory including 384 MB overhead
2019-04-26 10:23:21 INFO  Client:54 - Setting up container launch context for our AM
2019-04-26 10:23:21 INFO  Client:54 - Setting up the launch environment for our AM container
2019-04-26 10:23:21 INFO  Client:54 - Preparing resources for our AM container
2019-04-26 10:23:21 WARN  Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
[I 190426 10:23:21 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:21 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:21 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:21.630 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:22 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:22 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:22 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:22.137 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:22 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:22 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:22 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:22.654 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:23 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:23 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:23 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:23.162 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:23 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:23 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:23 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:23.673 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:24 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:24 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:24 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:24.181 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:24 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:24 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:24 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:24.688 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
2019-04-26 10:23:24 INFO  Client:54 - Uploading resource file:/tmp/spark-46ee3bd0-c174-4f86-8a9b-0ea594331508/__spark_libs__18863826034816602.zip -> hdfs://Gondolin-Node-021:9000/user/root/.sparkStaging/application_1554949496955_0099/__spark_libs__18863826034816602.zip
[I 190426 10:23:25 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:25 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:25 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:25.195 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:25 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:25 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:25 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:25.706 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:26 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:26 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:26 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:26.215 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:26 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:26 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:26 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:26.723 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
2019-04-26 10:23:27 INFO  Client:54 - Uploading resource file:/opt/work/lib/analytics-zoo-bigdl_0.7.2-spark_2.4.0-0.5.0-SNAPSHOT-jar-with-dependencies.jar -> hdfs://Gondolin-Node-021:9000/user/root/.sparkStaging/application_1554949496955_0099/analytics-zoo-bigdl_0.7.2-spark_2.4.0-0.5.0-SNAPSHOT-jar-with-dependencies.jar
[I 190426 10:23:27 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:27 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:27 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:27.230 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:27 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:27 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:27 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:27.740 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:28 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:28 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:28 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:28.249 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:28 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:28 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:28 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:28.757 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
2019-04-26 10:23:29 INFO  Client:54 - Uploading resource file:/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/scripts/launch_ipykernel.py -> hdfs://Gondolin-Node-021:9000/user/root/.sparkStaging/application_1554949496955_0099/launch_ipykernel.py
2019-04-26 10:23:29 INFO  Client:54 - Uploading resource file:/opt/work/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip -> hdfs://Gondolin-Node-021:9000/user/root/.sparkStaging/application_1554949496955_0099/pyspark.zip
2019-04-26 10:23:29 INFO  Client:54 - Uploading resource file:/opt/work/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip -> hdfs://Gondolin-Node-021:9000/user/root/.sparkStaging/application_1554949496955_0099/py4j-0.10.7-src.zip
[I 190426 10:23:29 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:29 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:29 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:29.264 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
2019-04-26 10:23:29 INFO  Client:54 - Uploading resource file:/tmp/spark-46ee3bd0-c174-4f86-8a9b-0ea594331508/__spark_conf__8974819232904740162.zip -> hdfs://Gondolin-Node-021:9000/user/root/.sparkStaging/application_1554949496955_0099/__spark_conf__.zip
2019-04-26 10:23:29 INFO  SecurityManager:54 - Changing view acls to: root
2019-04-26 10:23:29 INFO  SecurityManager:54 - Changing modify acls to: root
2019-04-26 10:23:29 INFO  SecurityManager:54 - Changing view acls groups to:
2019-04-26 10:23:29 INFO  SecurityManager:54 - Changing modify acls groups to:
2019-04-26 10:23:29 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
[I 190426 10:23:29 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:29 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:29 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:29.775 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:30 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:30 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:30 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:30.282 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
2019-04-26 10:23:30 INFO  Client:54 - Submitting application application_1554949496955_0099 to ResourceManager
2019-04-26 10:23:30 INFO  YarnClientImpl:273 - Submitted application application_1554949496955_0099
2019-04-26 10:23:30 INFO  Client:54 - Application report for application_1554949496955_0099 (state: ACCEPTED)
[I 190426 10:23:30 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
2019-04-26 10:23:30 INFO  Client:54 -
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1556245410763
     final status: UNDEFINED
     tracking URL: http://Gondolin-Node-021:8188/proxy/application_1554949496955_0099/
     user: root
[D 190426 10:23:30 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:30 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[I 2019-04-26 10:23:30.790 EnterpriseGatewayApp] ApplicationID: 'application_1554949496955_0099' assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9', state: ACCEPTED, 12.0 seconds after starting.
[I 190426 10:23:30 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099
2019-04-26 10:23:30 INFO  ShutdownHookManager:54 - Shutdown hook called
2019-04-26 10:23:30 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-ddf95560-6e4f-457d-8517-b98c214b05f9
[D 190426 10:23:30 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:30 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099 HTTP/1.1" 200 None
[D 2019-04-26 10:23:30.795 EnterpriseGatewayApp] 24: State: 'ACCEPTED', Host: '', KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9', ApplicationID: 'application_1554949496955_0099'
2019-04-26 10:23:30 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-46ee3bd0-c174-4f86-8a9b-0ea594331508
[I 190426 10:23:31 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099
[D 190426 10:23:31 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:31 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099 HTTP/1.1" 200 None
[D 2019-04-26 10:23:31.302 EnterpriseGatewayApp] 25: State: 'ACCEPTED', Host: 'Gondolin-Node-022', KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9', ApplicationID: 'application_1554949496955_0099'
[D 2019-04-26 10:23:38.572 EnterpriseGatewayApp] Received Payload 'Xszuv3ZW7Do1ZwAoZ7Zq9j2FcDw2tGa5uwZRceqajIioAqu12Bfrs2nBq6/uHiOizN50vVyP4JaxPnUbs4dAb33RCYbJ7FA2An/UIAaMH59AAjiH77l2Dm7CZI+RcRGGyyxjapCtAW7C6wDVlEcbWyJpjpMIIZ/pcvtHKvUVMaOrdbYK9YK9py1nAlf/OLovKAmoBQVHCHUT2Tm+a6e7UITQCyLqLSESW/gZc5MyiAEJFxRiWVK5kK6lwcEnknuPIOxDf5EmPgs8afIcJbYCMLf2Mz9iyXDvReIYWWUUhU90/a2Ol2ZSK0m4YFhqagNNwC2j5rSJOwioF0DKbjkqw58AxnWQtGtvO0c7tfsDKZZK6O3bRnXCjm5oCd/wtZmKYZVCJrAtSHW18OIYXqdIKw=='
[D 2019-04-26 10:23:38.573 EnterpriseGatewayApp] Decrypted Payload '{"shell_port": 39267, "iopub_port": 45769, "stdin_port": 43742, "control_port": 46286, "hb_port": 49998, "ip": "0.0.0.0", "key": "b52f8dd7-bc1d-44a6-82c8-4b58c6a8a984", "transport": "tcp", "signature_scheme": "hmac-sha256", "kernel_name": "", "pid": "78096", "pgid": "78016", "comm_port": 49269}'
[D 2019-04-26 10:23:38.573 EnterpriseGatewayApp] Connect Info received from the launcher is as follows '{'shell_port': 39267, 'iopub_port': 45769, 'stdin_port': 43742, 'control_port': 46286, 'hb_port': 49998, 'ip': '0.0.0.0', 'key': 'b52f8dd7-bc1d-44a6-82c8-4b58c6a8a984', 'transport': 'tcp', 'signature_scheme': 'hmac-sha256', 'kernel_name': '', 'pid': '78096', 'pgid': '78016', 'comm_port': 49269}'
[D 2019-04-26 10:23:38.573 EnterpriseGatewayApp] Host assigned to the Kernel is: 'Gondolin-Node-022' '172.168.2.122'
[D 2019-04-26 10:23:38.573 EnterpriseGatewayApp] Established gateway communication to: 172.168.2.122:49269 for KernelID '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9'
[D 2019-04-26 10:23:38.576 EnterpriseGatewayApp] Updated pid to: 78096
[D 2019-04-26 10:23:38.576 EnterpriseGatewayApp] Updated pgid to: 78016
[D 2019-04-26 10:23:38.582 EnterpriseGatewayApp] Received connection info for KernelID '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' from host 'Gondolin-Node-022': {'shell_port': 39267, 'iopub_port': 45769, 'stdin_port': 43742, 'control_port': 46286, 'hb_port': 49998, 'ip': '172.168.2.122', 'key': b'b52f8dd7-bc1d-44a6-82c8-4b58c6a8a984', 'transport': 'tcp', 'signature_scheme': 'hmac-sha256', 'kernel_name': '', 'comm_port': 49269}...
[D 2019-04-26 10:23:38.585 EnterpriseGatewayApp] Connecting to: tcp://172.168.2.122:46286
[D 2019-04-26 10:23:38.588 EnterpriseGatewayApp] Connecting to: tcp://172.168.2.122:45769
[I 2019-04-26 10:23:38.592 EnterpriseGatewayApp] Kernel started: 2ce0e70f-1f67-4d82-8f5c-64965d2ccce9
[D 2019-04-26 10:23:38.592 EnterpriseGatewayApp] Kernel args: {'env': {'PATH': '/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/conda/bin:/opt/work/conda/condabin:/opt/work/hadoop-2.7.2/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/jdk/bin', 'KERNEL_WORKING_DIR': '/home/s', 'KERNEL_USERNAME': 'root'}, 'kernel_name': 'spark_python_yarn_cluster'}
[I 190426 10:23:38 web:2246] 201 POST /api/kernels (127.0.0.1) 20231.43ms
[D 2019-04-26 10:23:38.785 EnterpriseGatewayApp] activity on 2ce0e70f-1f67-4d82-8f5c-64965d2ccce9: status
[I 190426 10:23:41 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:23:41 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:41 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 190426 10:23:44 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:23:44 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:44 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 190426 10:23:47 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:23:47 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:47 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 190426 10:23:50 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:23:50 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:50 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 190426 10:23:53 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:23:53 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:53 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 190426 10:23:56 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:23:56 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:56 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 190426 10:23:59 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:23:59 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:59 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 190426 10:24:02 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:24:02 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:24:02 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 190426 10:24:05 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:24:05 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:24:05 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 190426 10:24:08 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:24:08 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:24:08 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 190426 10:24:11 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:24:11 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:24:11 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 190426 10:24:14 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:24:14 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:24:14 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 190426 10:24:17 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:24:17 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:24:17 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 190426 10:24:20 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:24:20 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:24:20 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 2019-04-26 10:24:23.140 EnterpriseGatewayApp] Interrupted...
[D 2019-04-26 10:24:23.141 EnterpriseGatewayApp] Clearing buffer for 2ce0e70f-1f67-4d82-8f5c-64965d2ccce9
[D 2019-04-26 10:24:23.143 EnterpriseGatewayApp] RemoteKernelManager.signal_kernel(9)
[D 2019-04-26 10:24:23.143 EnterpriseGatewayApp] YarnClusterProcessProxy.send_signal 9
[I 190426 10:24:23 base:50] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:24:23 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:24:23 connectionpool:393] http://gondolin-node-021:8188 "PUT /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 202 None
[W 2019-04-26 10:24:23.152 EnterpriseGatewayApp] Termination of application 'application_1554949496955_0099' failed with exception: 'Response finished with status: 202. Details: {"state":"RUNNING"}'.  Continuing...
[D 2019-04-26 10:24:23.152 EnterpriseGatewayApp] YarnClusterProcessProxy.kill: kill_app_by_id(application_1554949496955_0099) response: None, confirming app state is not RUNNING
[I 190426 10:24:23 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:24:23 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:24:23 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[D 2019-04-26 10:24:23.158 EnterpriseGatewayApp] Sending signal: 15 to target: -78016 on host: 172.168.2.122
/opt/work/conda/lib/python3.6/site-packages/paramiko/kex_ecdh_nist.py:39: CryptographyDeprecationWarning: encode_point has been deprecated on EllipticCurvePublicNumbers and will be removed in a future version. Please use EllipticCurvePublicKey.public_bytes to obtain both compressed and uncompressed point encoding.
  m.add_string(self.Q_C.public_numbers().encode_point())
/opt/work/conda/lib/python3.6/site-packages/paramiko/kex_ecdh_nist.py:96: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
  self.curve, Q_S_bytes
/opt/work/conda/lib/python3.6/site-packages/paramiko/kex_ecdh_nist.py:111: CryptographyDeprecationWarning: encode_point has been deprecated on EllipticCurvePublicNumbers and will be removed in a future version. Please use EllipticCurvePublicKey.public_bytes to obtain both compressed and uncompressed point encoding.
  hm.add_string(self.Q_C.public_numbers().encode_point())
[E 2019-04-26 10:24:23.218 EnterpriseGatewayApp] Failed to authenticate SSHClient with password-less SSH
[W 2019-04-26 10:24:23.218 EnterpriseGatewayApp] Remote signal(15) to '-78016' on host '172.168.2.122' failed with exception 'HTTP 403: Failed to authenticate SSHClient with password-less SSH'.
[D 2019-04-26 10:24:23.218 EnterpriseGatewayApp] SIGTERM signal sent to pid: 78096
[I 190426 10:24:23 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:24:23 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:24:23 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[D 2019-04-26 10:24:23.223 EnterpriseGatewayApp] YarnClusterProcessProxy.kill, application ID: application_1554949496955_0099, kernel ID: 2ce0e70f-1f67-4d82-8f5c-64965d2ccce9, state: KILLED
[I 190426 10:24:23 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:24:23 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:24:23 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 2019-04-26 10:24:23.229 EnterpriseGatewayApp] Kernel shutdown: 2ce0e70f-1f67-4d82-8f5c-64965d2ccce9
Litchilitchy commented 5 years ago

By the way I set KERNEL_LAUNCH_TIMEOUT=300000 and still timeout if I upload the big jar file. Seems nothing changed

kevin-bates commented 5 years ago

Thank you for the log file - it was helpful.

First, regarding your last comment, I have some questions.

  1. Do you encounter these issues when NOT uploading large jar files?
  2. How many yarn-cluster kernels are you able to run at a given time?
  3. Is the cluster shared with other applications?

As for the log file, things are looking completely normal. The kernel was started and had entered the ACCEPTED state on a given node...

D 2019-04-26 10:23:18.510 EnterpriseGatewayApp] Launching kernel: Spark - Python (YARN Cluster Mode) with command: ['/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh', '/root/.local/share/jupyter/runtime/kernel-2ce0e70f-1f67-4d82-8f5c-64965d2ccce9.json', '--RemoteProcessProxy.response-address', '172.168.2.121:53670', '--RemoteProcessProxy.port-range', '0..0', '--RemoteProcessProxy.spark-context-initialization-mode', 'lazy']
...
[D 2019-04-26 10:23:38.573 EnterpriseGatewayApp] Connect Info received from the launcher is as follows '{'shell_port':}'
[D 2019-04-26 10:23:38.573 EnterpriseGatewayApp] Host assigned to the Kernel is: 'Gondolin-Node-022' '172.168.2.122'
[D 2019-04-26 10:23:38.573 EnterpriseGatewayApp] Established gateway communication to: 172.168.2.122:49269 for KernelID 
[D 2019-04-26 10:23:31.302 EnterpriseGatewayApp] 25: State: 'ACCEPTED', Host: 'Gondolin-Node-022', KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9', ApplicationID: 'application_1554949496955_0099'

All in all, that sequence was 20 seconds (and within the default 30 seconds since I don't see an indication that KERNEL_LAUNCH_TIMEOUT is set).

Then, at the 1 minute 5 second mark, Enterprise Gateway is interrupted...

[I 2019-04-26 10:24:23.140 EnterpriseGatewayApp] Interrupted...

which triggers the kill sequence on the kernel. What you see is all the fallback code to terminate a kernel quickly - like when shutting down the EG instance.

More Questions. :smile:

  1. Is there someone or something interrupting EG with SIGINT or ctrl-c about a minute after the kernel startup?
  2. Have you analyzed the YARN logs (stdout and stderr in particular) for application ID application_1554949496955_0099? I suspect that may indicate why the application never got started to where it could respond to the client.
  3. How long after the attempted start does it take for the Kernel Error message to appear in the notebook? I'd like to determine if its shortly after the 20 second mark or so.

Thank you.

Litchilitchy commented 5 years ago

@kevin-bates To answer your concerns,

  1. I do not encounter this issue when jar file is not uploaded. (if not uploaded then timeout error only occurs when we try to start them synchronously and this is due to your implementation as you told before, so overall this answer is I do not)

  2. I found we have about 130 vcores and each notebook takes 2 (so 10 notebooks would not be a problem I think, and this error just occur when we start the first notebook)

  3. There are no other applications to share it, so I think we have enough resources

  4. I interrupted the kernel, after the connection error is raised and the kernel error occurs in the notebook front end (The gateway still show many logs after the error so I interrupted, I think this would not affect the error log)

  5. Not yet, will keep up with it later.

  6. About 20 seconds, (seems is your default value?) I set those parameters by setting the system environment variables. (seems not work? I found something like os.getenv(...) in nb2kg source code)

kevin-bates commented 5 years ago

Thank you for the responses.

So it definitely sounds like this is purely related to the large file uploads (item 1). Let's not worry about the resources at the moment (items 2 and 3) especially if you see this on the first notebook (when large jars are uploaded).

Please do not interrupt the and especially do not interrupt EG. The interrupt of EG is what is terminating the kernel. Yes, it is running, but probably not entirely started.

The default launch timeout is 30 seconds, not 20 seconds, so there may be something else here, but its not evident in the logs. Please set envs KG_REQUEST_TIMEOUT=130 and KERNEL_LAUNCH_TIMEOUT=120 BEFORE launching Jupyter Notebook/NB2KG. The value is in seconds, so a value of 300000 (5000 minutes!) is a bit too long. If set correctly and propagated to the server, you should see the following in this EG log entry (BaseProcessProxy.launch_process() env)....

[D 2019-04-25 20:57:09.771 EnterpriseGatewayApp] BaseProcessProxy.launch_process() env: {'PATH': '/opt/anaconda2/envs/py3/bin:/opt/anaconda2/envs/py3/bin:/opt/anaconda2/condabin:/usr/lib64/qt-3.3/bin:/usr/java/jdk1.8.0_201-amd64/bin:/usr/java/jdk1.8.0_201-amd64/jre/bin:/opt/anaconda2/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/gateway/.local/bin:/home/gateway/bin', 'KERNEL_LAUNCH_TIMEOUT': '40', 'KERNEL_USERNAME': 'alice', ...

It is critical that you check the YARN logs for the appropriate application ID (which you can also obtain from the log files once the host as been assigned. From above: State: 'ACCEPTED', Host: 'Gondolin-Node-022',. They often contain useful information. You should check all four logs-especially in relation to the load of the log file - which I assume is: file:/tmp/spark-46ee3bd0-c174-4f86-8a9b-0ea594331508/__spark_conf__8974819232904740162.zip

Also, please provide the release of EG you're running. This will be posted in the startup message in the EG log - like:

[I 2019-04-23 15:00:51.828 EnterpriseGatewayApp] Jupyter Enterprise Gateway 2.1.0.dev2 is available at http://0.0.0.0:8888

Thanks.

kevin-bates commented 5 years ago

It just dawned on me what is happening. The default request timeout - enforced by tornado http requests - is 20 seconds! As a result, your KG_REQUEST_TIMEOUT value is not getting picked up in your environment.

Here's more log analysis that reminded me this must be happening. Your POST call - which starts the kernel, is taking just over 20 seconds...

[I 190426 10:23:38 web:2246] 201 POST /api/kernels (127.0.0.1) 20231.43ms

However, nothing happens after that - because the request timeout was met. What you should see, is a sequence like this, following the POST entry (this startup took about 11.6 secs)...

[I 190423 15:03:58 web:2162] 201 POST /api/kernels (9.108.92.21) 11605.17ms
[I 190423 15:03:58 web:2162] 200 GET /api/kernels/05f83018-d046-412b-95ba-9172d25645b4 (9.108.92.21) 1.68ms
[D 2019-04-23 15:03:58.245 EnterpriseGatewayApp] Initializing websocket connection /api/kernels/05f83018-d046-412b-95ba-9172d25645b4/channels
[W 2019-04-23 15:03:58.247 EnterpriseGatewayApp] No session ID specified
[D 2019-04-23 15:03:58.248 EnterpriseGatewayApp] Requesting kernel info from 05f83018-d046-412b-95ba-9172d25645b4

Notice the subsequent GET, followed by websocket initialization and, finally, a request for the kernel info.

However, on your system, the 20 seconds for the http request was expired, so the client stopped moving forward with websocket initialization.

Since you're running in jupyter hub, setting env variables for your notebook sessions is different. See the code that I linked to in this comment. Notice down a bit further at line 48 - the extraEnv: stanza. Add both KG_REQUEST_TIMEOUT=130 and KERNEL_LAUNCH_TIMEOUT=120 into that stanza. I have no idea if this requires a hub restart or not for subsequent notebook instances to pick up the changes.

Given the kernel startup is taking 20 seconds, I suspect values around 40 seconds would be more than sufficient, but let's go with 120 or so right now.

Litchilitchy commented 5 years ago

@kevin-bates Seems your linked comment is config in Kubernetes? If it is just in jupyterhub started by terminal, would setting system environment work? If not what does os.getenv(...) in nb2kg stands for?

(Currently we are trying to use kernel connecting to yarn cluster)

kevin-bates commented 5 years ago

Yes. For Jupyter hub the env for the notebook that is spawned needs to be setup via config files since there's no way to access the shell prior to the start of the notebook server. This is irrespective of kubernetes. You'll need to filter out the k8s stuff.

os.getenv() is used in programs to "get" env variables. The issue you need to resolve is the "put" or "set" of the variable. Normally "put" is fine in the shell via 'export'.

Fyi, I'm away from my computer today.

Litchilitchy commented 5 years ago

@kevin-bates Thanks for your help, I am just back from vacation.

Yes as you said, I just set the env variables via 'export', I am not sure if it works, for example, export HADOOP_HOME in shell would not work for spark config, we need to write it in spark-env.sh

May it be the similar issue with spark config? If yes, how to set this env variable?

kevin-bates commented 5 years ago

Now that you're talking about spark config, you should probably focus on the env: stanza of your kernelspec. Take a look at the kernelspecs EG provides - here's one for python yarn clusters. In particular the SPARK_OPTS entry. If the variable needs to be present in the spark driver itself, then you need to look at the spark.yarn.appMasterEnv option.

Based on your latest response, I can't determine if you're still experiencing timeouts or have moved forward since you're now talking about other variables.

I think you need to ensure you can get basic envs set via the jupyter hub configurations, then look into the kernelspec (kernel.json) files for system-specific tuning you need to do.

I'm not sure how much more I can help you without specific details or an online meeting.

cc: @lresende

Litchilitchy commented 5 years ago

@kevin-bates firstly I still have timeout problem, what I mentioned above is that, I did not find a solution to set the env variables correctly.

So according to what you said above, the first thing I can do is set env variables via python code in jupyterhub_config.py (if I get your idea right). I would try this first.

Litchilitchy commented 5 years ago

@kevin-bates sorry for late response, here is the latest updates.

Before that, I would explain my above notes. "I just set the env variables via 'export', I am not sure if it works", this means I had set the env variables via terminal, they could be seen via "echo" in terminal but I am not sure if they still exist (or be valid) in jupyterhub session, and according to the result, this setting does not work. (The kernel error would still be about 20 seconds after the new notebook starts, which means we fail to change it)

Then, I tried, use

import os
os.environ["KG_REQUEST_TIMEOUT"]="120"
os.environ["KERNEL_LAUNCH_TIMEOUT"]="120"

in both jupyterhub_config.py and jupyter_enterprise_gateway_config.py,

as you say, the unit is second. But still the old problem, about 20 seconds to raise the error. (I also added print(os.getenv("KG_REQUEST_TIMEOUT", "not-set")) in jupyterhub_enterprise_gateway_config.py and try to see if the env variables are set, but seems there is not output during the gateway starts, I am not sure if the config file would run as a python script when gateway starts??)

In brief, my point is: how to set this parameter?. You have provided the way to set it through k8s, but in common sense, there should be a more fundamental way(setting it without going through any containers), isn't it? Otherwise enterprise gateway would not work without k8s, which is weird.

Litchilitchy commented 5 years ago

@kevin-bates I think we have found a tmp solution, modify the site-package nb2kg code to

KG_CONNECT_TIMEOUT = float(os.getenv('KG_CONNECT_TIMEOUT', 60.0))

And we do not encounter timeout error now. However this could not be a perfect solution, so we are still waiting for your update

kevin-bates commented 5 years ago

First, thank you for your patience on this. Setting env variables is quite common, just that since the Notebook instance in this case is not spawned from the shell, it's slightly more complicated.

Modifying the variable's default value in the code isn't sufficient (or necessary), although that experiment confirms the hypothesis that the runtime environment of Notebook is the issue. I'm a little surprised KG_CONNECT_TIMEOUT required adjustment rather than just KG_REQUEST_TIMEOUT. That said, the two are a pair so let's focus on getting both set in your spawned Notebook instance.

Could you please provide the following:

  1. The jupyter configuration files you're currently using for both Hub and EG?
  2. The logs files (output) from both Hub and EG? The Hub log file should also contain the period of time in which the Notebook server was launched as well as the attempt to use the kernel when the timeout occurs.

I can't determine what exactly you've set and, unfortunately, I don't have a Hub installation to try this with. Perhaps someone can spot what needs to be done.

Looking around, I see this issue that may provide useful information: https://github.com/jupyterhub/jupyterhub/issues/330 Looks like you could set the KG TIMEOUT values in the env of JupyterHub prior to its launch, then use:

c.spawner.env_keep.append('KG_REQUEST_TIMEOUT')
c.spawner.env_keep.append('KG_CONNECT_TIMEOUT')

Here's the doc for env_keep. You could also specify "fresh" values (i.e., these are set when the Notebook instance is launched and are not required to be in the Hub environment) using c.spawner.environment.

Another thing you can do is just post a simple question on the JupyterHub forum ... How do I set an environment variable for my spawned Notebook instance to use?

Litchilitchy commented 5 years ago

Thanks for still following this issue @kevin-bates , actually the 2 things you want:

  1. config file, we do not set anything related to timeout in Hub and EG config file
  2. We are backing up this solution so maybe I would re-implement this issue few days later

But after all, you are talking about c.spawner.environment, seems this is different from the system env variables (which is set via 'export' in terminal). So is that jupyterhub owns a new session, and holds its own system environment and what nb2kg get is in jupyterhub session? Am I right?

kevin-bates commented 5 years ago

I believe we're getting very close.

Yes, c.spawner.environment is different. The reason for that is because Hub needs to create the environment for Notebook prior to Notebook's launch. If you use c.spawner.environment, Hub will add those entries into the spawned Notebook's environment - thereby making KG_*_TIMEOUT available when Notebook establishes a connection to Enterprise Gateway.

The other approach is to set the environment variables using export in the shell that invoked Jupyter Hub. Then, in the hub configuration file use c.spawner.env_keep. In this case, the kept environment variables are transferred to the Notebook's environment. In either case, use of the jupyter hub configuration file is required.

I'm curious how you're getting the KG_URL value set for Notebook - since that's how it knows where to direct the kernel management requests to Enterprise Gateway. How is KG_URL set?

I'm adding two Hub experts - perhaps they have information I have not provided or can confirm what I've stated.

cc: @minrk @consideRatio - tl;dr this is regarding the setting of environment variables into the spawned Notebook instance from a vanilla (non-k8s) Hub. Thank you.

Litchilitchy commented 5 years ago

Thanks, I think I almost understand.

In terms of KG_URL, I found that, I set it via export KG_URL=http://172.168.2.165:8888, (172.168.2.165 is the machine we are using, so I think 0.0.0.0:8888 is the same?)

But!!! I did this after I started EG, which means, without setting KG_URL, I started EG and the port is 8888, so currently, I do not find any evidence that we must set this value (perhaps the default value is 0.0.0.0:8888?)

kevin-bates commented 5 years ago

KG_URL is not used by EG. It is used by the Notebook server that is launched by Jupyter Hub. This is how the spawned Notebook server knows where to send its kernel management operations to. To repeat, KG_URL is purely used by the EG client - which is a Notebook server configured with the NB2KG extension.

An ip of 0.0.0.0 says - use whatever the appropriate address is to listen for requests. In this case, that address is 172.168.2.165. Its essentially a meta-value - kind of like localhost.

Port 8888 is the default port number. It can be configured differently as well.

So, if you set KG_URL only using export, then I assume you set that value PRIOR to launching Jupyter Hub and Hub must transfer its environment to the spawned Notebook servers - although that would contradict what c.spawner.env_keep is meant for. (Perhaps the env is not kept if envkeep is configured - I don't know.) However, given your Notebook Server is able to locate EG, then you should set the `KG*_TIMEOUT` values in the same manner.

Once you get things going that way, you should transition to using configuration files.

Litchilitchy commented 5 years ago

@kevin-bates Your concern make sense, here is the order of config:

  1. start EG

  2. install nb2kg using pip and jupyter serverextension enable --py nb2kg --sys-prefix

  3. (which is worth concerning) set env variables

    export KG_URL=...(this works according to what you said? And yes, this is set before launching jupyterhub) export KG_REQUEST_TIMEOUT=...(this would not work, as we talked before)

  4. launch jupyter notebook or jupyterhub

And, I do not think it make sense that KG_URL works but KG_REQUEST_TIMEOUT does not work. So I just export KG_URL= to unset this variable. And then, EG and jupyterhub still works, which prove that KG_URL maybe useless?

Currently, the only connection I can find between notebook/jupyterhub and kernel(KG or EG) is that the args --NotebookApp.kernel_manager..., and this is just the connection with nb2kg, by the way, I do not yet find that how nb2kg has connection with EG.

After all, anyway, this vanilla jupyter notebook and jupyterhub solution finally works, congratulations..!

kevin-bates commented 5 years ago

@Litchilitchy - I have some good news for you - if you're still interested. :smiley: (I apologize for the inconvenience this caused and appreciate your patience!)

I believe the reason KG_URL appears to work (even when not set) is because its default value is http://127.0.0.1:8888 and I believe your EG is located on the same server as your jupyterhub.

I decided to install jupyterhub and get this working (I'm sorry I didn't do this sooner!). Here's what I did...

  1. Confirmed that Notebook w/ NB2KG worked against Enterprise Gateway with KG_URL, KG_REQUEST_TIMEOUT values set using export.
  2. Installed jupyterhub: pip install jupyterhub. Version 1.0.0 was just released so you might want to make sure you're running that version.
  3. Generate a configuration file for jupyter hub. jupyterhub --generate-config. This produces a file named jupyterhub_config.py. All options are commented out, but its great to see what can be set.
  4. Edit jupyterhub_config.py with the following. I added mine at the top of the file.
from jupyterhub.spawner import LocalProcessSpawner

class NB2KGProcessSpawner(LocalProcessSpawner):
    def user_env(self, env):
        env['KERNEL_USERNAME'] = self.user.name
        return env

c.JupyterHub.spawner_class = NB2KGProcessSpawner

c.Spawner.args = ['--NotebookApp.session_manager_class=nb2kg.managers.SessionManager', '--NotebookApp.kernel_manager_class=nb2kg.managers.RemoteKernelManager', '--NotebookApp.kernel_spec_manager_class=nb2kg.managers.RemoteKernelSpecManager']
c.Spawner.environment = {'KG_URL': 'http://my-gateway-server:8888', 'KG_CONNECT_TIMEOUT': '60', 'KG_REQUEST_TIMEOUT': '60'}

This uses a custom spawner class inorder to set the KERNEL_USERNAME environment variable to the current hub user that is launching the Notebook instance.

c.Spawner.args are the command-line values that you'd normally invoke Notebook with when the NB2KG extension is used. c.Spawner.environment are the other KG_ env values. You could also set these simiarl to how KERNEL_USERNAME is set in the custom spawner, but I used both for example purposes.

Once the configuration is set, start jupyterhub. You should then be able to hit your EG server for kernels.

kevin-bates commented 5 years ago

I've also gone ahead and increased the KG timeout defaults to 60 seconds: https://github.com/jupyter/nb2kg/pull/35

Litchilitchy commented 5 years ago

@kevin-bates Thanks for your patience! Great, finally all of these make sense, which is that, Jupyterhub owns its own environment and KG_URL has its default value.

I think this issue could be closed. Besides, may I just have another few questions?

  1. Just as you said we currently put jupyterhub and EG on a same server (the most expensive thing is kernel and spark jobs(which locates on yarn slave nodes based on yarn cluster mode), so that jupyterhub and EG would not cost much and we currently put them on yarn master machine, we hope this would not have much drawbacks?)

  2. Based on current deployment, EG locates on yarn master, and it submits the job to yarn. I think it should also work if EG does not locate on yarn master?

  3. Do kernels locate on yarn slaves(Driver Node)? If yes, which part of EG make it? And SparkContext locates on kernel, am I right?

kevin-bates commented 5 years ago

Please feel free to ask questions any time - either here, at the Enterprise Gateway repo, gitter room, or the discourse forum!

  1. We recommend EG run on the yarn master, but it's not required. EG uses very little resources.
  2. Correct. You must have Spark client where EG is installed since spark-submit is used to launch the Spark job in Yarn cluster mode.
  3. When a yarn-cluster configured kernel is launched, the Yarn "application" is the kernel launcher which essentially embeds the appropriate kernel (IPython, R, Toree). As you know, spark-submit transfers the necessary files to the worker node where the launcher/kernel run. This process becomes the Spark driver. Once launched, EG uses its pluggable process proxy architecture to discover where the kernel "landed" across the worker nodes. The kernel launcher/wrapper sends the connection information back to EG and the EG process proxy then manages the lifecycle of the kernel using the Yarn REST API.

For the Python and R launchers, the launcher creates the SparkContext on behalf of the kernel. Since they're the same process, the context is available during execution of each cell. For the Scala kernels, the embedded Apache Toree kernel already has code to create the SparkContext. The other job of the launchers is to listen for interrupt/shutdown operations.

I hope that helps.

As suggested, I'm going to close this issue.