Closed Litchilitchy closed 5 years ago
I guess the 2) issue maybe to timeout because we need some time to upload a big file. However I set sys env variable KG_REQUEST_TIMEOUT
to very large (300000) but seems does not work.
There are a number of reasons you can experience timeouts during kernel startup - especially when remote kernels are in play via Enterprise Gateway. To better diagnose the nature of these issues, please provide the complete log file (with DEBUG enabled) from the Enterprise Gateway instance. The reason we'd like the entire file is to capture the complete request - from start thru error.
Here are a few items worth noting...
KG_REQUEST_TIMEOUT
should be increased in these kinds of scenarios and is used by the underlying http classes, Enterprise Gateway uses KERNEL_LAUNCH_TIMEOUT
as its means of determining when to "give up" on monitoring whether the remote kernel has started. It is recommended that KERNEL_LAUNCH_TIMEOUT
be set to KG_REQUEST_TIMEOUT
- although, technically, it should be slightly less because you don't want the request to timeout prior to the kernel launch. The default for KERNEL_LAUNCH_TIMEOUT
is 30
seconds - so if you're seeing your Timeout error in roughly 30 seconds - that's why. (A valid addition for NG2KG would be to set KERNEL_LAUNCH_TIMEOUT
to KG_REQUEST_TIMEOUT
iff the former is not already set in the env and the latter is. Just remember that very large timeouts take away from the startup of subsequent kernels during that time - until we can get the async PRs merged. 🙁 SPARK_OPTS
to use less resources.500
error codes.At any rate, if you can post your EG log (with DEBUG) we can take a closer look - although setting KERNEL_LAUNCH_TIMEOUT
may get you moving forward.
The full log
[D 2019-04-26 10:23:05.617 EnterpriseGatewayApp] Searching ['/opt/work', '/root/.jupyter', '/opt/work/conda/etc/jupyter', '/usr/local/etc/jupyter', '/etc/jupyter'] for config files
[D 2019-04-26 10:23:05.618 EnterpriseGatewayApp] Looking for jupyter_config in /etc/jupyter
[D 2019-04-26 10:23:05.618 EnterpriseGatewayApp] Looking for jupyter_config in /usr/local/etc/jupyter
[D 2019-04-26 10:23:05.618 EnterpriseGatewayApp] Looking for jupyter_config in /opt/work/conda/etc/jupyter
[D 2019-04-26 10:23:05.618 EnterpriseGatewayApp] Looking for jupyter_config in /root/.jupyter
[D 2019-04-26 10:23:05.618 EnterpriseGatewayApp] Looking for jupyter_config in /opt/work
[D 2019-04-26 10:23:05.619 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /etc/jupyter
[D 2019-04-26 10:23:05.620 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /usr/local/etc/jupyter
[D 2019-04-26 10:23:05.620 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /opt/work/conda/etc/jupyter
[D 2019-04-26 10:23:05.620 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /root/.jupyter
[D 2019-04-26 10:23:05.620 EnterpriseGatewayApp] Loaded config file: /root/.jupyter/jupyter_enterprise_gateway_config.py
[D 2019-04-26 10:23:05.621 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /opt/work
[D 190426 10:23:05 selector_events:65] Using selector: EpollSelector
[I 2019-04-26 10:23:05.632 EnterpriseGatewayApp] Jupyter Enterprise Gateway 1.2.0 is available at http://0.0.0.0:8888
[D 2019-04-26 10:23:15.643 EnterpriseGatewayApp] Found kernel python3 in /opt/work/conda/share/jupyter/kernels
[D 2019-04-26 10:23:15.644 EnterpriseGatewayApp] Found kernel spark_r_conductor_cluster in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:15.644 EnterpriseGatewayApp] Found kernel spark_python_yarn_cluster in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:15.644 EnterpriseGatewayApp] Found kernel spark_python_conductor_cluster in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:15.644 EnterpriseGatewayApp] Found kernel spark_r_yarn_client in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:15.644 EnterpriseGatewayApp] Found kernel spark_scala_conductor_cluster in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:15.644 EnterpriseGatewayApp] Found kernel spark_python_yarn_client in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:15.644 EnterpriseGatewayApp] Found kernel spark_scala_yarn_client in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:15.644 EnterpriseGatewayApp] Found kernel python_distributed in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:15.644 EnterpriseGatewayApp] Found kernel spark_r_yarn_cluster in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:15.644 EnterpriseGatewayApp] Found kernel spark_scala_yarn_cluster in /usr/local/share/jupyter/kernels
[I 190426 10:23:15 web:2246] 200 GET /api/kernelspecs (127.0.0.1) 1115.14ms
[D 2019-04-26 10:23:18.218 EnterpriseGatewayApp] Found kernel python3 in /opt/work/conda/share/jupyter/kernels
[D 2019-04-26 10:23:18.219 EnterpriseGatewayApp] Found kernel spark_r_conductor_cluster in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:18.219 EnterpriseGatewayApp] Found kernel spark_python_yarn_cluster in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:18.219 EnterpriseGatewayApp] Found kernel spark_python_conductor_cluster in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:18.220 EnterpriseGatewayApp] Found kernel spark_r_yarn_client in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:18.220 EnterpriseGatewayApp] Found kernel spark_scala_conductor_cluster in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:18.220 EnterpriseGatewayApp] Found kernel spark_python_yarn_client in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:18.220 EnterpriseGatewayApp] Found kernel spark_scala_yarn_client in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:18.220 EnterpriseGatewayApp] Found kernel python_distributed in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:18.220 EnterpriseGatewayApp] Found kernel spark_r_yarn_cluster in /usr/local/share/jupyter/kernels
[D 2019-04-26 10:23:18.220 EnterpriseGatewayApp] Found kernel spark_scala_yarn_cluster in /usr/local/share/jupyter/kernels
[I 190426 10:23:18 web:2246] 200 GET /api/kernelspecs (127.0.0.1) 23.02ms
[D 2019-04-26 10:23:18.365 EnterpriseGatewayApp] RemoteMappingKernelManager.start_kernel: spark_python_yarn_cluster, kernel_username: root
[D 2019-04-26 10:23:18.392 EnterpriseGatewayApp] Instantiating kernel 'Spark - Python (YARN Cluster Mode)' with process proxy: enterprise_gateway.services.processproxies.yarn.YarnClusterProcessProxy
[D 2019-04-26 10:23:18.508 EnterpriseGatewayApp] Response socket launched on 172.168.2.121, port: 53670 using 300000.0s timeout
[D 2019-04-26 10:23:18.508 EnterpriseGatewayApp] YarnClusterProcessProxy shutdown wait time adjusted to 15.0 seconds.
[D 2019-04-26 10:23:18.509 EnterpriseGatewayApp] Starting kernel: ['/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh', '/root/.local/share/jupyter/runtime/kernel-2ce0e70f-1f67-4d82-8f5c-64965d2ccce9.json', '--RemoteProcessProxy.response-address', '172.168.2.121:53670', '--RemoteProcessProxy.port-range', '0..0', '--RemoteProcessProxy.spark-context-initialization-mode', 'lazy']
[D 2019-04-26 10:23:18.510 EnterpriseGatewayApp] Launching kernel: Spark - Python (YARN Cluster Mode) with command: ['/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh', '/root/.local/share/jupyter/runtime/kernel-2ce0e70f-1f67-4d82-8f5c-64965d2ccce9.json', '--RemoteProcessProxy.response-address', '172.168.2.121:53670', '--RemoteProcessProxy.port-range', '0..0', '--RemoteProcessProxy.spark-context-initialization-mode', 'lazy']
[D 2019-04-26 10:23:18.510 EnterpriseGatewayApp] BaseProcessProxy.launch_process() env: {'PATH': '/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/conda/bin:/opt/work/conda/condabin:/opt/work/hadoop-2.7.2/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/jdk/bin', 'KERNEL_WORKING_DIR': '/home/s', 'KERNEL_USERNAME': 'root', 'SPARK_HOME': '/opt/work/spark-2.4.0-bin-hadoop2.7', 'PYSPARK_PYTHON': '/opt/work/conda/bin/python3', 'PYTHONPATH': '/opt/work/conda/lib/python3.6/site-packages:/opt/work/spark-2.4.0-bin-hadoop2.7/python:/opt/work/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip', 'SPARK_OPTS': '--executor-cores 1 --jars /opt/work/lib/analytics-zoo-bigdl_0.7.2-spark_2.4.0-0.5.0-SNAPSHOT-jar-with-dependencies.jar --master yarn --deploy-mode cluster --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID} --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=/home/${KERNEL_USERNAME}/.local --conf spark.yarn.appMasterEnv.PYTHONPATH=/opt/work/conda/lib/python3.6/site-packages:/opt/work/spark-2.4.0-bin-hadoop2.7/python:/opt/work/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip --conf spark.yarn.appMasterEnv.PATH=/opt/work/conda/bin:$PATH', 'LAUNCH_OPTS': '', 'KERNEL_GATEWAY': '1', 'EG_MIN_PORT_RANGE_SIZE': '1000', 'EG_MAX_PORT_RANGE_RETRIES': '5', 'KERNEL_ID': '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9', 'EG_IMPERSONATION_ENABLED': 'False'}
[D 2019-04-26 10:23:18.515 EnterpriseGatewayApp] Yarn cluster kernel launched using YARN endpoint: http://Gondolin-Node-021:8188/ws/v1/cluster, pid: 93033, Kernel ID: 2ce0e70f-1f67-4d82-8f5c-64965d2ccce9, cmd: '['/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh', '/root/.local/share/jupyter/runtime/kernel-2ce0e70f-1f67-4d82-8f5c-64965d2ccce9.json', '--RemoteProcessProxy.response-address', '172.168.2.121:53670', '--RemoteProcessProxy.port-range', '0..0', '--RemoteProcessProxy.spark-context-initialization-mode', 'lazy']'
Starting IPython kernel for Spark in Yarn Cluster mode on behalf of user root
+ eval exec /opt/work/spark-2.4.0-bin-hadoop2.7/bin/spark-submit '--executor-cores 1 --jars /opt/work/lib/analytics-zoo-bigdl_0.7.2-spark_2.4.0-0.5.0-SNAPSHOT-jar-with-dependencies.jar --master yarn --deploy-mode cluster --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID} --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=/home/${KERNEL_USERNAME}/.local --conf spark.yarn.appMasterEnv.PYTHONPATH=/opt/work/conda/lib/python3.6/site-packages:/opt/work/spark-2.4.0-bin-hadoop2.7/python:/opt/work/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip --conf spark.yarn.appMasterEnv.PATH=/opt/work/conda/bin:$PATH' '' /usr/local/share/jupyter/kernels/spark_python_yarn_cluster/scripts/launch_ipykernel.py '' /root/.local/share/jupyter/runtime/kernel-2ce0e70f-1f67-4d82-8f5c-64965d2ccce9.json --RemoteProcessProxy.response-address 172.168.2.121:53670 --RemoteProcessProxy.port-range 0..0 --RemoteProcessProxy.spark-context-initialization-mode lazy
++ exec /opt/work/spark-2.4.0-bin-hadoop2.7/bin/spark-submit --executor-cores 1 --jars /opt/work/lib/analytics-zoo-bigdl_0.7.2-spark_2.4.0-0.5.0-SNAPSHOT-jar-with-dependencies.jar --master yarn --deploy-mode cluster --name 2ce0e70f-1f67-4d82-8f5c-64965d2ccce9 --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=/home/root/.local --conf spark.yarn.appMasterEnv.PYTHONPATH=/opt/work/conda/lib/python3.6/site-packages:/opt/work/spark-2.4.0-bin-hadoop2.7/python:/opt/work/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip --conf spark.yarn.appMasterEnv.PATH=/opt/work/conda/bin:/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/conda/bin:/opt/work/conda/condabin:/opt/work/hadoop-2.7.2/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/jdk/bin /usr/local/share/jupyter/kernels/spark_python_yarn_cluster/scripts/launch_ipykernel.py /root/.local/share/jupyter/runtime/kernel-2ce0e70f-1f67-4d82-8f5c-64965d2ccce9.json --RemoteProcessProxy.response-address 172.168.2.121:53670 --RemoteProcessProxy.port-range 0..0 --RemoteProcessProxy.spark-context-initialization-mode lazy
[I 190426 10:23:19 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:19 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:19 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:19.091 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:19 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:19 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:19 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:19.602 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
2019-04-26 10:23:20 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[I 190426 10:23:20 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:20 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:20 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:20.109 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:20 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:20 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:20 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:20.616 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
2019-04-26 10:23:20 INFO RMProxy:98 - Connecting to ResourceManager at Gondolin-Node-021/172.168.2.121:8032
[I 190426 10:23:21 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:21 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:21 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:21.123 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
2019-04-26 10:23:21 INFO Client:54 - Requesting a new application from cluster with 2 NodeManagers
2019-04-26 10:23:21 INFO Client:54 - Verifying our application has not requested more than the maximum memory capability of the cluster (200000 MB per container)
2019-04-26 10:23:21 INFO Client:54 - Will allocate AM container, with 1408 MB memory including 384 MB overhead
2019-04-26 10:23:21 INFO Client:54 - Setting up container launch context for our AM
2019-04-26 10:23:21 INFO Client:54 - Setting up the launch environment for our AM container
2019-04-26 10:23:21 INFO Client:54 - Preparing resources for our AM container
2019-04-26 10:23:21 WARN Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
[I 190426 10:23:21 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:21 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:21 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:21.630 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:22 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:22 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:22 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:22.137 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:22 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:22 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:22 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:22.654 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:23 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:23 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:23 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:23.162 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:23 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:23 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:23 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:23.673 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:24 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:24 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:24 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:24.181 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:24 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:24 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:24 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:24.688 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
2019-04-26 10:23:24 INFO Client:54 - Uploading resource file:/tmp/spark-46ee3bd0-c174-4f86-8a9b-0ea594331508/__spark_libs__18863826034816602.zip -> hdfs://Gondolin-Node-021:9000/user/root/.sparkStaging/application_1554949496955_0099/__spark_libs__18863826034816602.zip
[I 190426 10:23:25 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:25 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:25 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:25.195 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:25 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:25 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:25 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:25.706 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:26 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:26 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:26 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:26.215 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:26 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:26 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:26 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:26.723 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
2019-04-26 10:23:27 INFO Client:54 - Uploading resource file:/opt/work/lib/analytics-zoo-bigdl_0.7.2-spark_2.4.0-0.5.0-SNAPSHOT-jar-with-dependencies.jar -> hdfs://Gondolin-Node-021:9000/user/root/.sparkStaging/application_1554949496955_0099/analytics-zoo-bigdl_0.7.2-spark_2.4.0-0.5.0-SNAPSHOT-jar-with-dependencies.jar
[I 190426 10:23:27 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:27 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:27 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:27.230 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:27 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:27 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:27 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:27.740 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:28 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:28 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:28 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:28.249 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:28 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:28 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:28 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:28.757 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
2019-04-26 10:23:29 INFO Client:54 - Uploading resource file:/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/scripts/launch_ipykernel.py -> hdfs://Gondolin-Node-021:9000/user/root/.sparkStaging/application_1554949496955_0099/launch_ipykernel.py
2019-04-26 10:23:29 INFO Client:54 - Uploading resource file:/opt/work/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip -> hdfs://Gondolin-Node-021:9000/user/root/.sparkStaging/application_1554949496955_0099/pyspark.zip
2019-04-26 10:23:29 INFO Client:54 - Uploading resource file:/opt/work/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip -> hdfs://Gondolin-Node-021:9000/user/root/.sparkStaging/application_1554949496955_0099/py4j-0.10.7-src.zip
[I 190426 10:23:29 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:29 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:29 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:29.264 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
2019-04-26 10:23:29 INFO Client:54 - Uploading resource file:/tmp/spark-46ee3bd0-c174-4f86-8a9b-0ea594331508/__spark_conf__8974819232904740162.zip -> hdfs://Gondolin-Node-021:9000/user/root/.sparkStaging/application_1554949496955_0099/__spark_conf__.zip
2019-04-26 10:23:29 INFO SecurityManager:54 - Changing view acls to: root
2019-04-26 10:23:29 INFO SecurityManager:54 - Changing modify acls to: root
2019-04-26 10:23:29 INFO SecurityManager:54 - Changing view acls groups to:
2019-04-26 10:23:29 INFO SecurityManager:54 - Changing modify acls groups to:
2019-04-26 10:23:29 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
[I 190426 10:23:29 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:29 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:29 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:29.775 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
[I 190426 10:23:30 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
[D 190426 10:23:30 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:30 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[D 2019-04-26 10:23:30.282 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' - retrying...
2019-04-26 10:23:30 INFO Client:54 - Submitting application application_1554949496955_0099 to ResourceManager
2019-04-26 10:23:30 INFO YarnClientImpl:273 - Submitted application application_1554949496955_0099
2019-04-26 10:23:30 INFO Client:54 - Application report for application_1554949496955_0099 (state: ACCEPTED)
[I 190426 10:23:30 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps
2019-04-26 10:23:30 INFO Client:54 -
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1556245410763
final status: UNDEFINED
tracking URL: http://Gondolin-Node-021:8188/proxy/application_1554949496955_0099/
user: root
[D 190426 10:23:30 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:30 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps?startedTimeBegin=1556245398000 HTTP/1.1" 200 None
[I 2019-04-26 10:23:30.790 EnterpriseGatewayApp] ApplicationID: 'application_1554949496955_0099' assigned for KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9', state: ACCEPTED, 12.0 seconds after starting.
[I 190426 10:23:30 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099
2019-04-26 10:23:30 INFO ShutdownHookManager:54 - Shutdown hook called
2019-04-26 10:23:30 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-ddf95560-6e4f-457d-8517-b98c214b05f9
[D 190426 10:23:30 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:30 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099 HTTP/1.1" 200 None
[D 2019-04-26 10:23:30.795 EnterpriseGatewayApp] 24: State: 'ACCEPTED', Host: '', KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9', ApplicationID: 'application_1554949496955_0099'
2019-04-26 10:23:30 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-46ee3bd0-c174-4f86-8a9b-0ea594331508
[I 190426 10:23:31 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099
[D 190426 10:23:31 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:31 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099 HTTP/1.1" 200 None
[D 2019-04-26 10:23:31.302 EnterpriseGatewayApp] 25: State: 'ACCEPTED', Host: 'Gondolin-Node-022', KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9', ApplicationID: 'application_1554949496955_0099'
[D 2019-04-26 10:23:38.572 EnterpriseGatewayApp] Received Payload 'Xszuv3ZW7Do1ZwAoZ7Zq9j2FcDw2tGa5uwZRceqajIioAqu12Bfrs2nBq6/uHiOizN50vVyP4JaxPnUbs4dAb33RCYbJ7FA2An/UIAaMH59AAjiH77l2Dm7CZI+RcRGGyyxjapCtAW7C6wDVlEcbWyJpjpMIIZ/pcvtHKvUVMaOrdbYK9YK9py1nAlf/OLovKAmoBQVHCHUT2Tm+a6e7UITQCyLqLSESW/gZc5MyiAEJFxRiWVK5kK6lwcEnknuPIOxDf5EmPgs8afIcJbYCMLf2Mz9iyXDvReIYWWUUhU90/a2Ol2ZSK0m4YFhqagNNwC2j5rSJOwioF0DKbjkqw58AxnWQtGtvO0c7tfsDKZZK6O3bRnXCjm5oCd/wtZmKYZVCJrAtSHW18OIYXqdIKw=='
[D 2019-04-26 10:23:38.573 EnterpriseGatewayApp] Decrypted Payload '{"shell_port": 39267, "iopub_port": 45769, "stdin_port": 43742, "control_port": 46286, "hb_port": 49998, "ip": "0.0.0.0", "key": "b52f8dd7-bc1d-44a6-82c8-4b58c6a8a984", "transport": "tcp", "signature_scheme": "hmac-sha256", "kernel_name": "", "pid": "78096", "pgid": "78016", "comm_port": 49269}'
[D 2019-04-26 10:23:38.573 EnterpriseGatewayApp] Connect Info received from the launcher is as follows '{'shell_port': 39267, 'iopub_port': 45769, 'stdin_port': 43742, 'control_port': 46286, 'hb_port': 49998, 'ip': '0.0.0.0', 'key': 'b52f8dd7-bc1d-44a6-82c8-4b58c6a8a984', 'transport': 'tcp', 'signature_scheme': 'hmac-sha256', 'kernel_name': '', 'pid': '78096', 'pgid': '78016', 'comm_port': 49269}'
[D 2019-04-26 10:23:38.573 EnterpriseGatewayApp] Host assigned to the Kernel is: 'Gondolin-Node-022' '172.168.2.122'
[D 2019-04-26 10:23:38.573 EnterpriseGatewayApp] Established gateway communication to: 172.168.2.122:49269 for KernelID '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9'
[D 2019-04-26 10:23:38.576 EnterpriseGatewayApp] Updated pid to: 78096
[D 2019-04-26 10:23:38.576 EnterpriseGatewayApp] Updated pgid to: 78016
[D 2019-04-26 10:23:38.582 EnterpriseGatewayApp] Received connection info for KernelID '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9' from host 'Gondolin-Node-022': {'shell_port': 39267, 'iopub_port': 45769, 'stdin_port': 43742, 'control_port': 46286, 'hb_port': 49998, 'ip': '172.168.2.122', 'key': b'b52f8dd7-bc1d-44a6-82c8-4b58c6a8a984', 'transport': 'tcp', 'signature_scheme': 'hmac-sha256', 'kernel_name': '', 'comm_port': 49269}...
[D 2019-04-26 10:23:38.585 EnterpriseGatewayApp] Connecting to: tcp://172.168.2.122:46286
[D 2019-04-26 10:23:38.588 EnterpriseGatewayApp] Connecting to: tcp://172.168.2.122:45769
[I 2019-04-26 10:23:38.592 EnterpriseGatewayApp] Kernel started: 2ce0e70f-1f67-4d82-8f5c-64965d2ccce9
[D 2019-04-26 10:23:38.592 EnterpriseGatewayApp] Kernel args: {'env': {'PATH': '/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/hadoop-2.7.2/bin:/opt/work/conda/bin:/opt/jdk1.8.0_152/bin:/opt/work/conda/bin:/opt/work/conda/bin:/opt/work/conda/condabin:/opt/work/hadoop-2.7.2/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/jdk/bin', 'KERNEL_WORKING_DIR': '/home/s', 'KERNEL_USERNAME': 'root'}, 'kernel_name': 'spark_python_yarn_cluster'}
[I 190426 10:23:38 web:2246] 201 POST /api/kernels (127.0.0.1) 20231.43ms
[D 2019-04-26 10:23:38.785 EnterpriseGatewayApp] activity on 2ce0e70f-1f67-4d82-8f5c-64965d2ccce9: status
[I 190426 10:23:41 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:23:41 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:41 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 190426 10:23:44 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:23:44 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:44 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 190426 10:23:47 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:23:47 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:47 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 190426 10:23:50 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:23:50 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:50 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 190426 10:23:53 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:23:53 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:53 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 190426 10:23:56 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:23:56 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:56 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 190426 10:23:59 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:23:59 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:23:59 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 190426 10:24:02 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:24:02 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:24:02 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 190426 10:24:05 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:24:05 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:24:05 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 190426 10:24:08 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:24:08 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:24:08 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 190426 10:24:11 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:24:11 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:24:11 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 190426 10:24:14 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:24:14 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:24:14 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 190426 10:24:17 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:24:17 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:24:17 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 190426 10:24:20 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:24:20 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:24:20 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 2019-04-26 10:24:23.140 EnterpriseGatewayApp] Interrupted...
[D 2019-04-26 10:24:23.141 EnterpriseGatewayApp] Clearing buffer for 2ce0e70f-1f67-4d82-8f5c-64965d2ccce9
[D 2019-04-26 10:24:23.143 EnterpriseGatewayApp] RemoteKernelManager.signal_kernel(9)
[D 2019-04-26 10:24:23.143 EnterpriseGatewayApp] YarnClusterProcessProxy.send_signal 9
[I 190426 10:24:23 base:50] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:24:23 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:24:23 connectionpool:393] http://gondolin-node-021:8188 "PUT /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 202 None
[W 2019-04-26 10:24:23.152 EnterpriseGatewayApp] Termination of application 'application_1554949496955_0099' failed with exception: 'Response finished with status: 202. Details: {"state":"RUNNING"}'. Continuing...
[D 2019-04-26 10:24:23.152 EnterpriseGatewayApp] YarnClusterProcessProxy.kill: kill_app_by_id(application_1554949496955_0099) response: None, confirming app state is not RUNNING
[I 190426 10:24:23 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:24:23 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:24:23 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[D 2019-04-26 10:24:23.158 EnterpriseGatewayApp] Sending signal: 15 to target: -78016 on host: 172.168.2.122
/opt/work/conda/lib/python3.6/site-packages/paramiko/kex_ecdh_nist.py:39: CryptographyDeprecationWarning: encode_point has been deprecated on EllipticCurvePublicNumbers and will be removed in a future version. Please use EllipticCurvePublicKey.public_bytes to obtain both compressed and uncompressed point encoding.
m.add_string(self.Q_C.public_numbers().encode_point())
/opt/work/conda/lib/python3.6/site-packages/paramiko/kex_ecdh_nist.py:96: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
self.curve, Q_S_bytes
/opt/work/conda/lib/python3.6/site-packages/paramiko/kex_ecdh_nist.py:111: CryptographyDeprecationWarning: encode_point has been deprecated on EllipticCurvePublicNumbers and will be removed in a future version. Please use EllipticCurvePublicKey.public_bytes to obtain both compressed and uncompressed point encoding.
hm.add_string(self.Q_C.public_numbers().encode_point())
[E 2019-04-26 10:24:23.218 EnterpriseGatewayApp] Failed to authenticate SSHClient with password-less SSH
[W 2019-04-26 10:24:23.218 EnterpriseGatewayApp] Remote signal(15) to '-78016' on host '172.168.2.122' failed with exception 'HTTP 403: Failed to authenticate SSHClient with password-less SSH'.
[D 2019-04-26 10:24:23.218 EnterpriseGatewayApp] SIGTERM signal sent to pid: 78096
[I 190426 10:24:23 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:24:23 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:24:23 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[D 2019-04-26 10:24:23.223 EnterpriseGatewayApp] YarnClusterProcessProxy.kill, application ID: application_1554949496955_0099, kernel ID: 2ce0e70f-1f67-4d82-8f5c-64965d2ccce9, state: KILLED
[I 190426 10:24:23 base:30] API Endpoint http://gondolin-node-021:8188/ws/v1/cluster/apps/application_1554949496955_0099/state
[D 190426 10:24:23 connectionpool:205] Starting new HTTP connection (1): gondolin-node-021:8188
[D 190426 10:24:23 connectionpool:393] http://gondolin-node-021:8188 "GET /ws/v1/cluster/apps/application_1554949496955_0099/state HTTP/1.1" 200 None
[I 2019-04-26 10:24:23.229 EnterpriseGatewayApp] Kernel shutdown: 2ce0e70f-1f67-4d82-8f5c-64965d2ccce9
By the way I set KERNEL_LAUNCH_TIMEOUT=300000
and still timeout if I upload the big jar file. Seems nothing changed
Thank you for the log file - it was helpful.
First, regarding your last comment, I have some questions.
As for the log file, things are looking completely normal. The kernel was started and had entered the ACCEPTED state on a given node...
D 2019-04-26 10:23:18.510 EnterpriseGatewayApp] Launching kernel: Spark - Python (YARN Cluster Mode) with command: ['/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh', '/root/.local/share/jupyter/runtime/kernel-2ce0e70f-1f67-4d82-8f5c-64965d2ccce9.json', '--RemoteProcessProxy.response-address', '172.168.2.121:53670', '--RemoteProcessProxy.port-range', '0..0', '--RemoteProcessProxy.spark-context-initialization-mode', 'lazy']
...
[D 2019-04-26 10:23:38.573 EnterpriseGatewayApp] Connect Info received from the launcher is as follows '{'shell_port':}'
[D 2019-04-26 10:23:38.573 EnterpriseGatewayApp] Host assigned to the Kernel is: 'Gondolin-Node-022' '172.168.2.122'
[D 2019-04-26 10:23:38.573 EnterpriseGatewayApp] Established gateway communication to: 172.168.2.122:49269 for KernelID
[D 2019-04-26 10:23:31.302 EnterpriseGatewayApp] 25: State: 'ACCEPTED', Host: 'Gondolin-Node-022', KernelID: '2ce0e70f-1f67-4d82-8f5c-64965d2ccce9', ApplicationID: 'application_1554949496955_0099'
All in all, that sequence was 20 seconds (and within the default 30 seconds since I don't see an indication that KERNEL_LAUNCH_TIMEOUT
is set).
Then, at the 1 minute 5 second mark, Enterprise Gateway is interrupted...
[I 2019-04-26 10:24:23.140 EnterpriseGatewayApp] Interrupted...
which triggers the kill sequence on the kernel. What you see is all the fallback code to terminate a kernel quickly - like when shutting down the EG instance.
More Questions. :smile:
application_1554949496955_0099
? I suspect that may indicate why the application never got started to where it could respond to the client.Thank you.
@kevin-bates To answer your concerns,
I do not encounter this issue when jar file is not uploaded. (if not uploaded then timeout error only occurs when we try to start them synchronously and this is due to your implementation as you told before, so overall this answer is I do not)
I found we have about 130 vcores and each notebook takes 2 (so 10 notebooks would not be a problem I think, and this error just occur when we start the first notebook)
There are no other applications to share it, so I think we have enough resources
I interrupted the kernel, after the connection error is raised and the kernel error occurs in the notebook front end (The gateway still show many logs after the error so I interrupted, I think this would not affect the error log)
Not yet, will keep up with it later.
About 20 seconds, (seems is your default value?) I set those parameters by setting the system environment variables. (seems not work? I found something like os.getenv(...)
in nb2kg source code)
Thank you for the responses.
So it definitely sounds like this is purely related to the large file uploads (item 1). Let's not worry about the resources at the moment (items 2 and 3) especially if you see this on the first notebook (when large jars are uploaded).
Please do not interrupt the and especially do not interrupt EG. The interrupt of EG is what is terminating the kernel. Yes, it is running, but probably not entirely started.
The default launch timeout is 30 seconds, not 20 seconds, so there may be something else here, but its not evident in the logs. Please set envs KG_REQUEST_TIMEOUT=130
and KERNEL_LAUNCH_TIMEOUT=120
BEFORE launching Jupyter Notebook/NB2KG. The value is in seconds, so a value of 300000 (5000 minutes!) is a bit too long. If set correctly and propagated to the server, you should see the following in this EG log entry (BaseProcessProxy.launch_process() env
)....
[D 2019-04-25 20:57:09.771 EnterpriseGatewayApp] BaseProcessProxy.launch_process() env: {'PATH': '/opt/anaconda2/envs/py3/bin:/opt/anaconda2/envs/py3/bin:/opt/anaconda2/condabin:/usr/lib64/qt-3.3/bin:/usr/java/jdk1.8.0_201-amd64/bin:/usr/java/jdk1.8.0_201-amd64/jre/bin:/opt/anaconda2/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/gateway/.local/bin:/home/gateway/bin', 'KERNEL_LAUNCH_TIMEOUT': '40', 'KERNEL_USERNAME': 'alice', ...
It is critical that you check the YARN logs for the appropriate application ID (which you can also obtain from the log files once the host as been assigned. From above: State: 'ACCEPTED', Host: 'Gondolin-Node-022',
. They often contain useful information. You should check all four logs-especially in relation to the load of the log file - which I assume is: file:/tmp/spark-46ee3bd0-c174-4f86-8a9b-0ea594331508/__spark_conf__8974819232904740162.zip
Also, please provide the release of EG you're running. This will be posted in the startup message in the EG log - like:
[I 2019-04-23 15:00:51.828 EnterpriseGatewayApp] Jupyter Enterprise Gateway 2.1.0.dev2 is available at http://0.0.0.0:8888
Thanks.
It just dawned on me what is happening. The default request timeout - enforced by tornado http requests - is 20 seconds! As a result, your KG_REQUEST_TIMEOUT
value is not getting picked up in your environment.
Here's more log analysis that reminded me this must be happening. Your POST
call - which starts the kernel, is taking just over 20 seconds...
[I 190426 10:23:38 web:2246] 201 POST /api/kernels (127.0.0.1) 20231.43ms
However, nothing happens after that - because the request timeout was met. What you should see, is a sequence like this, following the POST
entry (this startup took about 11.6 secs)...
[I 190423 15:03:58 web:2162] 201 POST /api/kernels (9.108.92.21) 11605.17ms
[I 190423 15:03:58 web:2162] 200 GET /api/kernels/05f83018-d046-412b-95ba-9172d25645b4 (9.108.92.21) 1.68ms
[D 2019-04-23 15:03:58.245 EnterpriseGatewayApp] Initializing websocket connection /api/kernels/05f83018-d046-412b-95ba-9172d25645b4/channels
[W 2019-04-23 15:03:58.247 EnterpriseGatewayApp] No session ID specified
[D 2019-04-23 15:03:58.248 EnterpriseGatewayApp] Requesting kernel info from 05f83018-d046-412b-95ba-9172d25645b4
Notice the subsequent GET
, followed by websocket initialization and, finally, a request for the kernel info.
However, on your system, the 20 seconds for the http request was expired, so the client stopped moving forward with websocket initialization.
Since you're running in jupyter hub, setting env variables for your notebook sessions is different. See the code that I linked to in this comment. Notice down a bit further at line 48
- the extraEnv:
stanza. Add both KG_REQUEST_TIMEOUT=130
and KERNEL_LAUNCH_TIMEOUT=120
into that stanza. I have no idea if this requires a hub restart or not for subsequent notebook instances to pick up the changes.
Given the kernel startup is taking 20 seconds, I suspect values around 40 seconds would be more than sufficient, but let's go with 120 or so right now.
@kevin-bates Seems your linked comment is config in Kubernetes? If it is just in jupyterhub started by terminal, would setting system environment work? If not what does os.getenv(...)
in nb2kg stands for?
(Currently we are trying to use kernel connecting to yarn cluster)
Yes. For Jupyter hub the env for the notebook that is spawned needs to be setup via config files since there's no way to access the shell prior to the start of the notebook server. This is irrespective of kubernetes. You'll need to filter out the k8s stuff.
os.getenv() is used in programs to "get" env variables. The issue you need to resolve is the "put" or "set" of the variable. Normally "put" is fine in the shell via 'export'.
Fyi, I'm away from my computer today.
@kevin-bates Thanks for your help, I am just back from vacation.
Yes as you said, I just set the env variables via 'export', I am not sure if it works, for example, export HADOOP_HOME
in shell would not work for spark config, we need to write it in spark-env.sh
May it be the similar issue with spark config? If yes, how to set this env variable?
Now that you're talking about spark config, you should probably focus on the env:
stanza of your kernelspec. Take a look at the kernelspecs EG provides - here's one for python yarn clusters. In particular the SPARK_OPTS
entry. If the variable needs to be present in the spark driver itself, then you need to look at the spark.yarn.appMasterEnv option.
Based on your latest response, I can't determine if you're still experiencing timeouts or have moved forward since you're now talking about other variables.
I think you need to ensure you can get basic envs set via the jupyter hub configurations, then look into the kernelspec (kernel.json) files for system-specific tuning you need to do.
I'm not sure how much more I can help you without specific details or an online meeting.
cc: @lresende
@kevin-bates firstly I still have timeout problem, what I mentioned above is that, I did not find a solution to set the env variables correctly.
So according to what you said above, the first thing I can do is set env variables via python code in jupyterhub_config.py
(if I get your idea right). I would try this first.
@kevin-bates sorry for late response, here is the latest updates.
Before that, I would explain my above notes. "I just set the env variables via 'export', I am not sure if it works", this means I had set the env variables via terminal, they could be seen via "echo" in terminal but I am not sure if they still exist (or be valid) in jupyterhub session, and according to the result, this setting does not work. (The kernel error would still be about 20 seconds after the new notebook starts, which means we fail to change it)
Then, I tried, use
import os
os.environ["KG_REQUEST_TIMEOUT"]="120"
os.environ["KERNEL_LAUNCH_TIMEOUT"]="120"
in both jupyterhub_config.py
and jupyter_enterprise_gateway_config.py
,
as you say, the unit is second. But still the old problem, about 20 seconds to raise the error. (I also added print(os.getenv("KG_REQUEST_TIMEOUT", "not-set"))
in jupyterhub_enterprise_gateway_config.py
and try to see if the env variables are set, but seems there is not output during the gateway starts, I am not sure if the config file would run as a python script when gateway starts??)
In brief, my point is: how to set this parameter?. You have provided the way to set it through k8s, but in common sense, there should be a more fundamental way(setting it without going through any containers), isn't it? Otherwise enterprise gateway would not work without k8s, which is weird.
@kevin-bates I think we have found a tmp solution, modify the site-package nb2kg code to
KG_CONNECT_TIMEOUT = float(os.getenv('KG_CONNECT_TIMEOUT', 60.0))
And we do not encounter timeout error now. However this could not be a perfect solution, so we are still waiting for your update
First, thank you for your patience on this. Setting env variables is quite common, just that since the Notebook instance in this case is not spawned from the shell, it's slightly more complicated.
Modifying the variable's default value in the code isn't sufficient (or necessary), although that experiment confirms the hypothesis that the runtime environment of Notebook is the issue. I'm a little surprised KG_CONNECT_TIMEOUT
required adjustment rather than just KG_REQUEST_TIMEOUT
. That said, the two are a pair so let's focus on getting both set in your spawned Notebook instance.
Could you please provide the following:
I can't determine what exactly you've set and, unfortunately, I don't have a Hub installation to try this with. Perhaps someone can spot what needs to be done.
Looking around, I see this issue that may provide useful information: https://github.com/jupyterhub/jupyterhub/issues/330 Looks like you could set the KG TIMEOUT values in the env of JupyterHub prior to its launch, then use:
c.spawner.env_keep.append('KG_REQUEST_TIMEOUT')
c.spawner.env_keep.append('KG_CONNECT_TIMEOUT')
Here's the doc for env_keep
. You could also specify "fresh" values (i.e., these are set when the Notebook instance is launched and are not required to be in the Hub environment) using c.spawner.environment
.
Another thing you can do is just post a simple question on the JupyterHub forum ... How do I set an environment variable for my spawned Notebook instance to use?
Thanks for still following this issue @kevin-bates , actually the 2 things you want:
But after all, you are talking about c.spawner.environment
, seems this is different from the system env variables (which is set via 'export' in terminal). So is that jupyterhub owns a new session, and holds its own system environment and what nb2kg get is in jupyterhub session? Am I right?
I believe we're getting very close.
Yes, c.spawner.environment
is different. The reason for that is because Hub needs to create the environment for Notebook prior to Notebook's launch. If you use c.spawner.environment
, Hub will add those entries into the spawned Notebook's environment - thereby making KG_*_TIMEOUT
available when Notebook establishes a connection to Enterprise Gateway.
The other approach is to set the environment variables using export
in the shell that invoked Jupyter Hub. Then, in the hub configuration file use c.spawner.env_keep
. In this case, the kept environment variables are transferred to the Notebook's environment. In either case, use of the jupyter hub configuration file is required.
I'm curious how you're getting the KG_URL
value set for Notebook - since that's how it knows where to direct the kernel management requests to Enterprise Gateway. How is KG_URL
set?
I'm adding two Hub experts - perhaps they have information I have not provided or can confirm what I've stated.
cc: @minrk @consideRatio - tl;dr this is regarding the setting of environment variables into the spawned Notebook instance from a vanilla (non-k8s) Hub. Thank you.
Thanks, I think I almost understand.
In terms of KG_URL
, I found that, I set it via export KG_URL=http://172.168.2.165:8888
, (172.168.2.165
is the machine we are using, so I think 0.0.0.0:8888
is the same?)
But!!! I did this after I started EG, which means, without setting KG_URL
, I started EG and the port is 8888, so currently, I do not find any evidence that we must set this value (perhaps the default value is 0.0.0.0:8888
?)
KG_URL
is not used by EG. It is used by the Notebook server that is launched by Jupyter Hub. This is how the spawned Notebook server knows where to send its kernel management operations to. To repeat, KG_URL
is purely used by the EG client - which is a Notebook server configured with the NB2KG extension.
An ip
of 0.0.0.0
says - use whatever the appropriate address is to listen for requests. In this case, that address is 172.168.2.165
. Its essentially a meta-value - kind of like localhost
.
Port 8888
is the default port number. It can be configured differently as well.
So, if you set KG_URL
only using export
, then I assume you set that value PRIOR to launching Jupyter Hub and Hub must transfer its environment to the spawned Notebook servers - although that would contradict what c.spawner.env_keep
is meant for. (Perhaps the env is not kept if envkeep is configured - I don't know.) However, given your Notebook Server is able to locate EG, then you should set the `KG*_TIMEOUT` values in the same manner.
Once you get things going that way, you should transition to using configuration files.
@kevin-bates Your concern make sense, here is the order of config:
start EG
install nb2kg using pip and jupyter serverextension enable --py nb2kg --sys-prefix
(which is worth concerning) set env variables
export KG_URL=...(this works according to what you said? And yes, this is set before launching jupyterhub) export KG_REQUEST_TIMEOUT=...(this would not work, as we talked before)
launch jupyter notebook or jupyterhub
And, I do not think it make sense that KG_URL
works but KG_REQUEST_TIMEOUT
does not work. So I just export KG_URL=
to unset this variable. And then, EG and jupyterhub still works, which prove that KG_URL
maybe useless?
Currently, the only connection I can find between notebook/jupyterhub and kernel(KG or EG) is that the args --NotebookApp.kernel_manager...
, and this is just the connection with nb2kg, by the way, I do not yet find that how nb2kg has connection with EG.
After all, anyway, this vanilla jupyter notebook and jupyterhub solution finally works, congratulations..!
@Litchilitchy - I have some good news for you - if you're still interested. :smiley: (I apologize for the inconvenience this caused and appreciate your patience!)
I believe the reason KG_URL
appears to work (even when not set) is because its default value is http://127.0.0.1:8888
and I believe your EG is located on the same server as your jupyterhub.
I decided to install jupyterhub and get this working (I'm sorry I didn't do this sooner!). Here's what I did...
KG_URL
, KG_REQUEST_TIMEOUT
values set using export
.pip install jupyterhub
. Version 1.0.0
was just released so you might want to make sure you're running that version.jupyterhub --generate-config
. This produces a file named jupyterhub_config.py
. All options are commented out, but its great to see what can be set.jupyterhub_config.py
with the following. I added mine at the top of the file.from jupyterhub.spawner import LocalProcessSpawner
class NB2KGProcessSpawner(LocalProcessSpawner):
def user_env(self, env):
env['KERNEL_USERNAME'] = self.user.name
return env
c.JupyterHub.spawner_class = NB2KGProcessSpawner
c.Spawner.args = ['--NotebookApp.session_manager_class=nb2kg.managers.SessionManager', '--NotebookApp.kernel_manager_class=nb2kg.managers.RemoteKernelManager', '--NotebookApp.kernel_spec_manager_class=nb2kg.managers.RemoteKernelSpecManager']
c.Spawner.environment = {'KG_URL': 'http://my-gateway-server:8888', 'KG_CONNECT_TIMEOUT': '60', 'KG_REQUEST_TIMEOUT': '60'}
This uses a custom spawner class inorder to set the KERNEL_USERNAME
environment variable to the current hub user that is launching the Notebook instance.
c.Spawner.args
are the command-line values that you'd normally invoke Notebook with when the NB2KG extension is used.
c.Spawner.environment
are the other KG_
env values. You could also set these simiarl to how KERNEL_USERNAME
is set in the custom spawner, but I used both for example purposes.
Once the configuration is set, start jupyterhub. You should then be able to hit your EG server for kernels.
I've also gone ahead and increased the KG timeout defaults to 60 seconds: https://github.com/jupyter/nb2kg/pull/35
@kevin-bates Thanks for your patience! Great, finally all of these make sense, which is that, Jupyterhub owns its own environment and KG_URL
has its default value.
I think this issue could be closed. Besides, may I just have another few questions?
Just as you said we currently put jupyterhub and EG on a same server (the most expensive thing is kernel and spark jobs(which locates on yarn slave nodes based on yarn cluster mode), so that jupyterhub and EG would not cost much and we currently put them on yarn master machine, we hope this would not have much drawbacks?)
Based on current deployment, EG locates on yarn master, and it submits the job to yarn. I think it should also work if EG does not locate on yarn master?
Do kernels locate on yarn slaves(Driver Node)? If yes, which part of EG make it? And SparkContext locates on kernel, am I right?
Please feel free to ask questions any time - either here, at the Enterprise Gateway repo, gitter room, or the discourse forum!
For the Python and R launchers, the launcher creates the SparkContext on behalf of the kernel. Since they're the same process, the context is available during execution of each cell. For the Scala kernels, the embedded Apache Toree kernel already has code to create the SparkContext. The other job of the launchers is to listen for interrupt/shutdown operations.
I hope that helps.
As suggested, I'm going to close this issue.
I am using nb2kg via
pip install "git+https://github.com/jupyter-incubator/nb2kg.git#egg=nb2kg"
in single user jupyterhub (so I guess it is the same behavior with jupyter notebook)I found there are several conditions I get Timeout error, I guess they are due to nb2kg
1) Open multiple notebooks the same time, like during the first one is created via "New" button, but kernel not ready, and open the second one at this time. In this case, at most 1 kernel will be ready and the others would get this error.
2) use
--jars
in Spark Config to upload a large file. (like in my case 350mb). This time I create them one by one (not create until the previous one is ready or fail). I created 5 notebooks, only the second one works, and the others get error.The error seems to be the same.
So any idea?
By the way I fork the newest nb2kg and find that
client.fetch
actually does not callfuture.result()
. So is nb2kg in pip obsolete?