jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Other
623 stars 222 forks source link

YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? #441

Closed ridwanoabdulazeez closed 6 years ago

ridwanoabdulazeez commented 6 years ago

2018-09-11 13:46:46.200 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [E 2018-09-11 13:46:46.201 EnterpriseGatewayApp] Error occurred during launch of KernelID: efea0f35-350c-42ca-a05a-efd0c17600e8. Check Enterprise Gateway log for more information. [E 180911 13:46:46 web:2106] 500 POST /api/kernels (127.0.0.1) 518.70ms Check Enterprise Gateway log for more informationError occurred during launch of KernelID: efea0f35-350c-42ca-a05a-efd0c17600e8. Check Enterprise Gateway log for more information.Enterprise

NOTE: I have jupyter notebook and jupyter enterprise gateway installed on the same server because i don't have access to YarnMaster cluster. Also my EG_ENDPOINT is pointing to the resource manager(cluster) and my KG_URL is pointing to url I got after launching EG

Any clue on how to go about this? thanks in advance

kevin-bates commented 6 years ago

I would enable debug on your EG command line (--log-level=DEBUG). This should provide some more details. I would also use the actual hostname for your EG_YARN_ENDPOINT host.

Please provide the complete EG log after restart and reattempting the kernel launch. This kind of thing is probably a timeout since the issue will likely be within the kernelspec configuration settings, unless you're using a HDP platform that matches the kernelspec examples we provide. As a result, it will likely be necessary for you to also include your kernelspec files kernel.json (and bin/run.sh - if applicable).

ridwanoabdulazeez commented 6 years ago

This is my EG log:

Starting IPython kernel for Spark in Yarn Cluster mode on behalf of user root

Warning: Could not find the WD Fusion Client jars ~

ridwanoabdulazeez commented 6 years ago

kernel.json file:

{ "language": "python", "display_name": "Spark - Python (YARN Cluster Mode)", "process_proxy": { "class_name": "enterprise_gateway.services.processproxies.yarn.YarnClusterProcessProxy" }, "env": { "SPARK_HOME": "/usr/hdp/current/spark2-client", "PYSPARK_PYTHON": "/opt/anaconda/bin/python", "PYTHONPATH": "${HOME}/.local/lib/python2.7/site-packages:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip", "SPARK_OPTS": "--master yarn --deploy-mode cluster --queue P_NO_SLA --name ${KERNEL_ID:-ERRORNOKERNEL_ID} --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=/home/yarn/.local --conf spark.yarn.appMasterEnv.PYTHONPATH=${HOME}/.local/lib/python2.7/site-packages:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip --conf spark.yarn.appMasterEnv.PATH=/opt/anaconda/bin:$PATH", "LAUNCH_OPTS": "" }, "argv": [ "/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh", "{connection_file}", "--RemoteProcessProxy.response-address", "{response_address}", "--RemoteProcessProxy.port-range", "{port_range}", "--RemoteProcessProxy.spark-context-initialization-mode", "lazy" ] }

ridwanoabdulazeez commented 6 years ago

!/usr/bin/env bash

if [ "${EG_IMPERSONATION_ENABLED}" = "True" ]; then IMPERSONATION_OPTS="--proxy-user ${KERNEL_USERNAME:-UNSPECIFIED}" USER_CLAUSE="as user ${KERNEL_USERNAME:-UNSPECIFIED}" else IMPERSONATION_OPTS="" USER_CLAUSE="on behalf of user ${KERNEL_USERNAME:-UNSPECIFIED}" fi

echo echo "Starting IPython kernel for Spark in Yarn Cluster mode ${USER_CLAUSE}" echo

if [ -z "${SPARK_HOME}" ]; then echo "SPARK_HOME must be set to the location of a Spark distribution!" exit 1 fi

PROG_HOME="$(cd "dirname "$0""/..; pwd)"

set -x eval exec \ "${SPARK_HOME}/bin/spark-submit" \ "${SPARK_OPTS}" \ "${IMPERSONATION_OPTS}" \ "${PROG_HOME}/scripts/launch_ipykernel.py" \ "${LAUNCH_OPTS}" \ "$@" set +x

ridwanoabdulazeez commented 6 years ago

@kevin-bates

ridwanoabdulazeez commented 6 years ago

[D 2018-09-11 18:52:52.072 EnterpriseGatewayApp] Yarn cluster kernel launched using YARN endpoint: http://localhost:8088/ws/v1/cluster, pid: 610, Kernel ID: 6540feac-ab35-43f5-bf97-2adf09dbcdf2, cmd: '[u'/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh', u'/root/.local/share/jupyter/runtime/kernel-6540feac-ab35-43f5-bf97-2adf09dbcdf2.json', u'--RemoteProcessProxy.response-address', u'172.17.0.2:36012', u'--RemoteProcessProxy.port-range', u'0..0', u'--RemoteProcessProxy.spark-context-initialization-mode', u'lazy']'

NOTE: I entered a different EG_YARN_ENDPOINT, I dont know why it is pointing to this Yarn endpoint: http://localhost:8088/ws/v1/cluster

ridwanoabdulazeez commented 6 years ago

Using one server and both nb2kg and enterprise gateway are installed on this server. here is the few step i took. export KG_URL=http://<0.0.0.0>:8888 #ip generated during the launch of EG jupyter serverextension enable --py nb2kg --sys-prefix Server 2 - Jupyter Enterprise Gateway (jeg is the name of the conda environment) export SPARK_HOME=SPARK_HOME:/usr/hdp/current/spark2-client
export EG_YARN_ENDPOINT=http:///ws/v1/cluster

Then launch the JEG using the following command - jupyter enterprisegateway --ip=0.0.0.0 --port_retries=0 --log-level=DEBUG After launching JEG , launched Jupyter Notebook server in another tab using jupyter notebook --no-browser --ip=* --NotebookApp.session_manager_class=nb2kg.managers.SessionManager --NotebookApp.kernel_manager_class=nb2kg.managers.RemoteKernelManager --NotebookApp.kernel_spec_manager_class=nb2kg.managers.RemoteKernelSpecManager --debug

ridwanoabdulazeez commented 6 years ago

Thanks in advance @kevin-bates

kevin-bates commented 6 years ago

If EG is not honoring the EG_YARN_ENDPOINT that you set (since localhost:8088 is the default), then you should make sure its exported in the environment prior to execution. We typically wrap the startup of EG in a script - since it (and most jupyter apps) have a fairly rich set of possible arguments. The script would also redirect stdout and stderr to a file - for easier inspection later.

At any rate, it looks like your spark configuration may not be fully satisfied based on this entry...

/usr/hdp/current/spark2-client/bin/spark-class: line 87: /etc/alternatives/java_sdk_1.8.0/bin/java: No such file or directory

I also notice that your PATH is referencing a different java than spark-class: /app/anaconda2/bin:/app/anaconda2/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/etc/alternatives/jre_1.8.0_openjdk/bin:/etc/alternatives/jre_1.8.0_openjdk/jre/bin

Lastly, your kernel.json's entries reference /opt/anaconda while your native PATH has /app/anaconda2 (twice). So there may be some anomalies stemming from different anaconda installations here as well (probably more down the road).

So first tackle the missing java stemming from the spark-submit invocation and capture the EG output in a file. I suggest getting spark-submit working outside of EG first to verify your spark installation before proceeding.

Since its clear that you're hitting the EG from the NB2KG-enabled notebook, that config is probably fine. I would, however, suggest you redirect the Notebook stdout and stderr to a separate file as well - just so we have a cleaner/clearer separation of the two.

Thanks.

ridwanoabdulazeez commented 6 years ago

Thanks @kevin-bates . I was able to fix the Java issue and I am getting this error now after exporting EG_YARN_ENDPOINT

Though I can see that application ID and kernel name has been assigned

[W 2018-09-12 17:22:39.182 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:39.684 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:40.186 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:40.688 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:41.190 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:41.192 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [E 180912 17:22:41 web:1621] Uncaught exception POST /api/kernels (127.0.0.1) HTTPServerRequest(protocol='http', host='0.0.0.0:8891', method='POST', uri='/api/kernels', version='HTTP/1.1', remote_ip='127.0.0.1') Traceback (most recent call last): File "/app/anaconda2/lib/python2.7/site-packages/tornado/web.py", line 1543, in _execute result = yield result File "/app/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 1099, in run value = future.result() File "/app/anaconda2/lib/python2.7/site-packages/tornado/concurrent.py", line 260, in result raise_exc_info(self._exc_info) File "/app/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 1107, in run yielded = self.gen.throw(exc_info) File "/app/anaconda2/lib/python2.7/site-packages/kernel_gateway/services/kernels/handlers.py", line 71, in post yield super(MainKernelHandler, self).post() File "/app/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 1099, in run value = future.result() File "/app/anaconda2/lib/python2.7/site-packages/tornado/concurrent.py", line 260, in result raise_exc_info(self._exc_info) File "/app/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 1107, in run yielded = self.gen.throw(exc_info) File "/app/anaconda2/lib/python2.7/site-packages/notebook/services/kernels/handlers.py", line 47, in post kernel_id = yield gen.maybe_future(km.start_kernel(kernel_name=model['name'])) File "/app/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 1099, in run value = future.result() File "/app/anaconda2/lib/python2.7/site-packages/tornado/concurrent.py", line 260, in result raise_exc_info(self._exc_info) File "/app/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 1107, in run yielded = self.gen.throw(exc_info) File "/app/anaconda2/lib/python2.7/site-packages/enterprise_gateway/services/kernels/remotemanager.py", line 28, in start_kernel kernel_id = yield gen.maybe_future(super(RemoteMappingKernelManager, self).start_kernel(args, kwargs)) File "/app/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 1099, in run value = future.result() File "/app/anaconda2/lib/python2.7/site-packages/tornado/concurrent.py", line 260, in result raise_exc_info(self._exc_info) File "/app/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 1107, in run yielded = self.gen.throw(exc_info) File "/app/anaconda2/lib/python2.7/site-packages/kernel_gateway/services/kernels/manager.py", line 81, in start_kernel kernel_id = yield gen.maybe_future(super(SeedingMappingKernelManager, self).start_kernel(args, kwargs)) File "/app/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 1099, in run value = future.result() File "/app/anaconda2/lib/python2.7/site-packages/tornado/concurrent.py", line 260, in result raise_exc_info(self._exc_info) File "/app/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 315, in wrapper yielded = next(result) File "/app/anaconda2/lib/python2.7/site-packages/notebook/services/kernels/kernelmanager.py", line 148, in start_kernel super(MappingKernelManager, self).start_kernel(kwargs) File "/app/anaconda2/lib/python2.7/site-packages/jupyter_client/multikernelmanager.py", line 110, in start_kernel km.start_kernel(kwargs) File "/app/anaconda2/lib/python2.7/site-packages/enterprise_gateway/services/kernels/remotemanager.py", line 102, in start_kernel return super(RemoteKernelManager, self).start_kernel(kw) File "/app/anaconda2/lib/python2.7/site-packages/jupyter_client/manager.py", line 259, in start_kernel kw) File "/app/anaconda2/lib/python2.7/site-packages/enterprise_gateway/services/kernels/remotemanager.py", line 131, in _launch_kernel return self.process_proxy.launch_process(kernel_cmd, kw) File "/app/anaconda2/lib/python2.7/site-packages/enterprise_gateway/services/processproxies/yarn.py", line 53, in launch_process self.confirm_remote_startup(kernel_cmd, kw) File "/app/anaconda2/lib/python2.7/site-packages/enterprise_gateway/services/processproxies/yarn.py", line 148, in confirm_remote_startup self.handle_timeout() File "/app/anaconda2/lib/python2.7/site-packages/enterprise_gateway/services/processproxies/yarn.py", line 200, in handle_timeout self.kill() File "/app/anaconda2/lib/python2.7/site-packages/enterprise_gateway/services/processproxies/yarn.py", line 115, in kill super(YarnClusterProcessProxy, self).kill() File "/app/anaconda2/lib/python2.7/site-packages/enterprise_gateway/services/processproxies/processproxy.py", line 203, in kill result = self.terminate() # Send -15 signal first File "/app/anaconda2/lib/python2.7/site-packages/enterprise_gateway/services/processproxies/processproxy.py", line 225, in terminate result = self.local_proc.terminate() File "/app/anaconda2/lib/python2.7/subprocess.py", line 1274, in terminate self.send_signal(signal.SIGTERM) File "/app/anaconda2/lib/python2.7/subprocess.py", line 1269, in send_signal os.kill(self.pid, sig) OSError: [Errno 3] No such process [E 180912 17:22:41 web:2106] 500 POST /api/kernels (127.0.0.1) 31169.43ms [W 2018-09-12 17:22:21.611 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:22.113 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:22.615 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:23.118 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:23.620 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:24.123 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:24.624 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:25.126 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:25.628 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:26.130 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:26.631 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:27.133 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:27.635 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:28.137 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:28.638 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:29.140 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:29.642 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:30.144 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:30.646 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:31.148 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:31.650 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:32.152 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:32.654 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:33.156 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:33.658 EnterpriseGatewayApp] YARN end-point: 'http://localhost:8088/ws/v1/cluster' refused the connection. Is the resource manager running? [W 2018-09-12 17:22:34.161 EnterpriseGatewayApp] YARN end-point: 'http

kevin-bates commented 6 years ago

Thanks for the update. So I assume that YARN endpoint is incorrect - right?

Its really strange that EG is not seeing the env variable - there must be some subtle issue here. Perhaps you should try moving that value into the command-line option instead?

e.g., --EnterpriseGatewayApp.yarn_endpoint=http://foo:8088/ws/v1/cluster

ridwanoabdulazeez commented 6 years ago

Thanks @kevin-bates

I get the below error after adding --EnterpriseGatewayApp.yarn_endpoint=http://foo:8088/ws/v1/cluster.

[W 2018-09-12 22:39:38.663 EnterpriseGatewayApp] Query for kernel ID 'dd50d14b-9f0e-4529-a117-b6e968ddb020' failed with exception: <class 'yarn_api_client.errors.APIError'> - 'Response finished with status: 401. Details:

Error 401 Authentication required
<body><h2>HTTP ERROR 401</h2>
<p>Problem accessing /ws/v1/cluster/apps. Reason:
<pre>    Authentication required</pre></p><hr /><i><small>Powered by Jetty://</small></i><br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                

</body>
</html>
'.  Continuing...

18/09/12 22:39:38 INFO impl.YarnClientImpl: Submitted application application_1536359064604_1161 18/09/12 22:39:38 INFO yarn.Client: Application report for application_1536359064604_1161 (state: ACCEPTED) 18/09/12 22:39:38 INFO yarn.Client: client token: Token { kind: YARN_CLIENT_TOKEN, service: } diagnostics: AM container is launched, waiting for AM container to Register with RM ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: P_NO_SLA start time: 1536791980515 final status: UNDEFINED tracking URL: /proxy/application_1536359064604_1161/ user: 18/09/12 22:39:38 INFO util.ShutdownHookManager: Shutdown hook called 18/09/12 22:39:38 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-4c2de855-1d6f-4a7f-b738-2888c90e013b [W 2018-09-12 22:39:39.167 EnterpriseGatewayApp] Query for kernel ID 'dd50d14b-9f0e-4529-a117-b6e968ddb020' failed with exception: <class 'yarn_api_client.errors.APIError'> - 'Response finished with status: 401. Details:

Error 401 Authentication required
<body><h2>HTTP ERROR 401</h2>
<p>Problem accessing /ws/v1/cluster/apps. Reason:
<pre>    Authentication required</pre></p><hr /><i><small>Powered by Jetty://</small></i><br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                

</body>
</html>
'.  Continuing...

[W 2018-09-12 22:39:39.673 EnterpriseGatewayApp] Query for kernel ID 'dd50d14b-9f0e-4529-a117-b6e968ddb020' failed with exception: <class 'yarn_api_client.errors.APIError'> - 'Response finished with status: 401. Details:

Error 401 Authentication required
<body><h2>HTTP ERROR 401</h2>
<p>Problem accessing /ws/v1/cluster/apps. Reason:
<pre>    Authentication required</pre></p><hr /><i><small>Powered by Jetty://</small></i><br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                
<br/>                                                

</body>
</html>
'.  Continuing...                                       
kevin-bates commented 6 years ago

ok - this implies you're getting further since its a different issue - unless you literally used foo as your host name. :smile: You're going to need to figure out how to access your YARN REST API. I would use the following command:

curl -x GET http://foo:8088/ws/v1/cluster

You should get a response like the following:

{"clusterInfo":{"id":1532968067554,"startedOn":1532968067554,"state":"STARTED","haState":"ACTIVE","rmStateStoreName":"org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore","resourceManagerVersion":"2.7.3.2.6.1.0-129","resourceManagerBuildVersion":"2.7.3.2.6.1.0-129 from 45e64533cdee3edf67c7b88a0267c64c194f93e5 by jenkins source checksum 4719543db2c322b41224f9173c683e7","resourceManagerVersionBuiltOn":"2017-05-31T03:16Z","hadoopVersion":"2.7.3.2.6.1.0-129","hadoopBuildVersion":"2.7.3.2.6.1.0-129 from 45e64533cdee3edf67c7b88a0267c64c194f93e5 by jenkins source checksum deba7ab784606611731cd7c37443e1c","hadoopVersionBuiltOn":"2017-05-31T03:06Z","haZooKeeperConnectionState":"ResourceManager HA is not enabled."}}

Once you get the curl command working, use the endpoint as your parameter value.

cc @lresende for other ideas/insights

ridwanoabdulazeez commented 6 years ago

Thanks @kevin-bates . I used the right parameter yet I still get the same error. I noticed it start running and tried to connect to the cluster immediately I choose the yarn cluster mode kernel and the kernel failed almost immediately even without inputing any code or clicking “run” on the notebook. Is this normal?

kevin-bates commented 6 years ago

Its normal while troubleshooting kernel startup. :smile: So your kernel is not fully starting. Probably timing out. Most of the information will be in the EG log (if you've captured stdout and stderr to a file). Please attach that file. Also, you can find helpful information in the YARN logs (typically the stdout log). Please provide that as well. While you're at it, add the notebook log to the set. Tar or zip is fine.

If the issue is due to a timeout, set EG_KERNEL_LAUNCH_TIMEOUT to a value like 60 (seconds). Default is something like 30.

lresende commented 6 years ago

Ok, it seems that if your HDP cluster has security enabled, the access to Yarn Resource Manager will be protected (see this). Let me setup a similar environment and make sure I provide you the necessary steps.

@kevin-bates It's probably timing out because it can't connect to YARN to get the kernel status?

lresende commented 6 years ago

@ridwanoabdulazeez Could you please describe what security options are enabled on your env? I have security enabled + kerberos and I don't get prompted for authorization when connecting to YARN RM. Do you also have LDAP Authentication enabled ? Any other option ? Are you going via Knox ?

ridwanoabdulazeez commented 6 years ago

Thanks @kevin-bates and @lresende . I have security enabled. I authenticate with kerberos which uses my AD credentials. also, the cluster have SPNEGO enabled

ridwanoabdulazeez commented 6 years ago

Still waiting for your inputs. @kevin-bates and @lresende

lresende commented 6 years ago

@ridwanoabdulazeez I am working on reproducing the same scenario you have. Most likely we will need to enable SPENEGO auth our the EG side.

ridwanoabdulazeez commented 6 years ago

Alright! thanks @lresende . looking forward to hearing from you soon

lresende commented 6 years ago

@ridwanoabdulazeez I was able to get EG working on a SPNEGO environmen. The required changes are mostly in the YARN Client API (see this PR). So I need to get that merged/released, and then update and release EG which might take a few days to complete the release cycles. Thanks for your patience.

ridwanoabdulazeez commented 6 years ago

Thanks @lresende and @kevin-bates . I will patiently wait

ridwanoabdulazeez commented 6 years ago

@lresende @kevin-bates I saw that it has been fixed. when will it be released

lresende commented 6 years ago

@ridwanoabdulazeez I finally got everything ready, and have a pre-release available for you to test and provide us with some feedback:

Note that you should enable both user impersonation and yarn endpoint security as described in User Impersonation doc section.

ridwanoabdulazeez commented 6 years ago

Thanks @lresende The page for the pre-release is not available

lresende commented 6 years ago

This is now available in the official EG 1.1.0 release. Please reopen if you continue to see issues on this release.