jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Other
623 stars 222 forks source link

Getting 'HTTP 403: Failed to authenticate SSHClient with password-less SSH' in Enterprise Gateway logs for YARN Client mode but from where? #394

Closed sedatkestepe closed 6 years ago

sedatkestepe commented 6 years ago

Hello, I wanted to evaluate Jupyter Enterprise Gateway on our edge node which has Hadoop (ecosystem) client binaries. The trouble I am living is as I open new Spark - Python (YARN Client Mode) notebook I receive 500 in client notebook logs from EG and receive 403 in EG logs but the source is kind of ambiguous in logging even though EG runs in debug mode.

Enterprise Gateway logs are below. What could be the source of 403? Thanks in advance

PS: I have an additional Spark installation (2.3.1 which I am planning to use after initial successful run) in PATH but it doesn't seems to be the problem since it is not included in appended CMD string.

[D 2018-07-23 19:40:11.403 EnterpriseGatewayApp] Found kernel spark_python_yarn_client in /opt/anaconda/anaconda2/share/jupyter/kernels
[D 2018-07-23 19:40:11.403 EnterpriseGatewayApp] Found kernel python2 in /opt/anaconda/anaconda2/share/jupyter/kernels
[I 180723 19:40:11 web:2106] 200 GET /api/kernelspecs (192.168.1.2) 2.92ms
[D 2018-07-23 19:40:11.576 EnterpriseGatewayApp] RemoteMappingKernelManager.start_kernel: spark_python_yarn_client
[D 2018-07-23 19:40:11.582 EnterpriseGatewayApp] Instantiating kernel 'Spark - Python (YARN Client Mode)' with process proxy: enterprise_gateway.services.processproxies.distributed.DistributedProcessProxy
[D 2018-07-23 19:40:11.583 EnterpriseGatewayApp] Response socket launched on 192.168.1.3, port: 46938 using 5.0s timeout
[D 2018-07-23 19:40:11.583 EnterpriseGatewayApp] Starting kernel: [u'/usr/local/share/jupyter/kernels/spark_python_yarn_client/bin/run.sh', u'/run/user/0/jupyter/kernel-21b04bd9-12ea-4749-801c-3ca20ac2f0bc.json', u'--RemoteProcessProxy.response-address', u'192.168.1.3:46938', u'--RemoteProcessProxy.port-range', u'0..0', u'--RemoteProcessProxy.spark-context-initialization-mode', u'lazy', u'--EnterpriseGatewayApp.authorized_users', u"['guest','sedat']"]
[D 2018-07-23 19:40:11.584 EnterpriseGatewayApp] Launching kernel: Spark - Python (YARN Client Mode) with command: [u'/usr/local/share/jupyter/kernels/spark_python_yarn_client/bin/run.sh', u'/run/user/0/jupyter/kernel-21b04bd9-12ea-4749-801c-3ca20ac2f0bc.json', u'--RemoteProcessProxy.response-address', u'192.168.1.3:46938', u'--RemoteProcessProxy.port-range', u'0..0', u'--RemoteProcessProxy.spark-context-initialization-mode', u'lazy', u'--EnterpriseGatewayApp.authorized_users', u"['guest']"]
[D 2018-07-23 19:40:11.585 EnterpriseGatewayApp] BaseProcessProxy.launch_process() env: {'EG_MAX_PORT_RANGE_RETRIES': '5', u'SPARK_OPTS': u'--master yarn --deploy-mode client --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID}', u'SPARK_YARN_USER_ENV': u'PYTHONUSERBASE=/home/yarn/.local,PYTHONPATH=${HOME}/.local/lib/python2.7/site-packages:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip,PATH=/opt/anaconda2/bin:$PATH', 'KERNEL_ID': u'21b04bd9-12ea-4749-801c-3ca20ac2f0bc', u'PYTHONPATH': u'${HOME}/.local/lib/python2.7/site-packages:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip', 'KERNEL_GATEWAY': '1', 'EG_MIN_PORT_RANGE_SIZE': '1000', 'EG_IMPERSONATION_ENABLED': 'False', u'KERNEL_USERNAME': u'guest', u'SPARK_HOME': u'/usr/hdp/current/spark2-client', 'PATH': '/opt/anaconda/anaconda2/bin:/opt/anaconda/anaconda2/bin:/usr/lib64/qt-3.3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/local/spark/spark-2.3.1-bin-hadoop2.7/bin', u'PYSPARK_PYTHON': u'/opt/anaconda2/bin/python', u'LAUNCH_OPTS': u'', u'KERNEL_PASSWORD': u'guest-password'}
[D 2018-07-23 19:40:11.585 EnterpriseGatewayApp] Invoking cmd: 'export KERNEL_USERNAME="guest";export KERNEL_ID="21b04bd9-12ea-4749-801c-3ca20ac2f0bc";export EG_IMPERSONATION_ENABLED="False";export SPARK_OPTS="--master yarn --deploy-mode client --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID}";export KERNEL_USERNAME="guest";export PYTHONPATH="${HOME}/.local/lib/python2.7/site-packages:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip";export SPARK_YARN_USER_ENV="PYTHONUSERBASE=/home/yarn/.local,PYTHONPATH=${HOME}/.local/lib/python2.7/site-packages:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip,PATH=/opt/anaconda2/bin:$PATH";export SPARK_HOME="/usr/hdp/current/spark2-client";export PYSPARK_PYTHON="/opt/anaconda2/bin/python";export LAUNCH_OPTS="";export KERNEL_PASSWORD="guest-password";nohup /usr/local/share/jupyter/kernels/spark_python_yarn_client/bin/run.sh /run/user/0/jupyter/kernel-21b04bd9-12ea-4749-801c-3ca20ac2f0bc.json --RemoteProcessProxy.response-address 192.168.1.3:46938 --RemoteProcessProxy.port-range 0..0 --RemoteProcessProxy.spark-context-initialization-mode lazy --EnterpriseGatewayApp.authorized_users ['guest'] >> /tmp/kernel-21b04bd9-12ea-4749-801c-3ca20ac2f0bc.log 2>&1 & echo $!' on host: localhost
[E 2018-07-23 19:40:11.605 EnterpriseGatewayApp] Failed to authenticate SSHClient with password-less SSH
[E 2018-07-23 19:40:11.605 EnterpriseGatewayApp] Failure occurred starting remote kernel on '127.0.0.1'. Exception: 'HTTP 403: Failed to authenticate SSHClient with password-less SSH'.
[E 180723 19:40:11 web:2106] 500 POST /api/kernels (192.168.1.2) 32.33ms
kevin-bates commented 6 years ago

Hello @sedatkestepe, thank you for the issue. There are a number of observations here, so I'm just going to enumerate them.

  1. The general issue is due to password-less ssh not being configured. Since your kernelspec is configured for the DistributedProcessProxy and since the localhost is being accessed, Enterprise Gateway still requires password-less ssh be configured, even in loopback operations.

  2. Since you're using the DistributedProcessProxy, you probably want to access other nodes of your cluster. These nodes must be specified with either env EG_REMOTE_HOSTS or --EnterpriseGatewayApp.remote_hosts when starting Enterprise Gateway. There are ways to configure the kernelspec itself to use its own remote hosts, but that's more of an advanced option. Please refer to the Enabling YARN Client Mode page in the docs. Password-less ssh must still be configured across whatever nodes are in use. Also, when configuring for remote hosts using DistributedProcessProxy, any potential hosts must contain the same kernelspecs file hierarchy in the same location as on the Gateway server.

  3. The kernelspec files provided in the distribution are samples. Make sure all paths are relative to your configuration. This isn't the issue you're seeing, but I just want to point this out.

  4. You might find running in YARN Cluster Mode is easier to get working since it doesn't require password-less ssh or distribution of the kernelspec hierarchy. Just be sure to configure EG_YARN_ENDPOINT or --EnterpriseGatewayApp.yarn_endpoint. The important item here is that the application name contain the kernel's id - which can be referenced via KERNEL_ID. The sample kernelspecs set name to KERNEL_ID entirely - although the id just needs to be contained within the application name.

  5. A good first step may be to simply launch the out-of-the-box python kernel (python2 or python3) - that is not provided by our kernelspecs tar file. This will launch a python kernel local to Enterprise Gateway, essentially behaving like Jupyter Kernel Gateway.

I hope this helps.

sedatkestepe commented 6 years ago

Hello @kevin-bates ,

Thanks for your response.

Let me detail what I am trying to do: We have an Hortonworks distribution Hadoop installed cluster which has Spark 2.2 within the stack. Some requirements occurred recently for new features in Spark 2.3.0 and also cluster computing power. I didn't want to get into a full stack update of HDP. So I have placed a secondary Spark release (2.3.1) on our edge server. When I was searching for configurations which would let me submit Jupyter notebooks on our existing resource scheduler, Yarn (I also want to keep it as the only resource scheduler) I found Enterprise Gateway and NB2KG as well.

kevin-bates commented 6 years ago

@sedatkestepe - thank you for the information.

Yes, all worker (data) nodes require the same Spark installation files. That's true even if you wanted to use YARN client mode as well.

Once you have your Spark installation working, we recommend that Enterprise Gateway be installed on the YARN master node and Notebook (w/ the NB2KG extensions) installed on your various clients (i.e., data scientists PCs). You would then set the KG_URL env on those PCs to point at the YARN master node where EG resides. We also support co-located Notebook servers, but that's not the use case we're targeting since we'd prefer a bring your own notebook model.

If you wanted to fall back to YARN client mode operation using the DistributedProcessProxy, then you'd need to configure password-less SSH across whatever nodes you wish to launch kernels on - typically these are the YARN nodes. These nodes/hosts are expresses via the EG_REMOTE_HOSTS env or --EnterpriseGatewayApp.remote_hosts command line option.

kevin-bates commented 6 years ago

Closing due to lack of activity but hope that's due to a sufficient answer. If the answer was not sufficient, please re-open the issue along with what else you need. Thank you.

amangarg96 commented 5 years ago

Hi,

I am using Jupyter Enterprise Gateway in Yarn Cluster Mode, and launching Jupyter Lab server on a remote Linux box.

I am facing the same 'HTTP 403: Failed to authenticate SSHClient with password-less SSH' issue, and here is the log for the same.

[I 2018-12-01 19:45:59.421 EnterpriseGatewayApp] Kernel shutdown: 0836005e-2a6b-4c5a-a825-f49bcdef5f30 [I 2018-12-01 19:49:56.613 EnterpriseGatewayApp] KernelRestarter: restarting kernel (1/5), keep random ports [W 2018-12-01 19:49:56.613 EnterpriseGatewayApp] Remote kernel (d90ff5df-2621-4a36-a0d3-1be2f8da1450) will not be automatically restarted since there are no clients connected at this time. [W 2018-12-01 19:49:56.618 EnterpriseGatewayApp] Termination of application 'application_1542713397395_139615' failed with exception: 'Response finished with status: 500. Details: {"RemoteException":{"exception":"WebApplicationException","message":"com.sun.jersey.api.MessageException: A message body reader for Java class org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.AppState, and Java type class org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.AppState, and MIME media type application/octet-stream was not found.\nThe registered message body readers compatible with the MIME media type are:\napplication/octet-stream ->\n com.sun.jersey.core.impl.provider.entity.ByteArrayProvider\n com.sun.jersey.core.impl.provider.entity.FileProvider\n com.sun.jersey.core.impl.provider.entity.InputStreamProvider\n com.sun.jersey.core.impl.provider.entity.DataSourceProvider\n com.sun.jersey.core.impl.provider.entity.RenderedImageProvider\n/ ->\n com.sun.jersey.core.impl.provider.entity.FormProvider\n com.sun.jersey.json.impl.provider.entity.JSONJAXBElementProvider$General\n com.sun.jersey.json.impl.provider.entity.JSONArrayProvider$General\n com.sun.jersey.json.impl.provider.entity.JSONObjectProvider$General\n com.sun.jersey.core.impl.provider.entity.StringProvider\n com.sun.jersey.core.impl.provider.entity.ByteArrayProvider\n com.sun.jersey.core.impl.provider.entity.FileProvider\n com.sun.jersey.core.impl.provider.entity.InputStreamProvider\n com.sun.jersey.core.impl.provider.entity.DataSourceProvider\n com.sun.jersey.core.impl.provider.entity.XMLJAXBElementProvider$General\n com.sun.jersey.core.impl.provider.entity.ReaderProvider\n com.sun.jersey.core.impl.provider.entity.DocumentProvider\n com.sun.jersey.core.impl.provider.entity.SourceProvider$StreamSourceReader\n com.sun.jersey.core.impl.provider.entity.SourceProvider$SAXSourceReader\n com.sun.jersey.core.impl.provider.entity.SourceProvider$DOMSourceReader\n com.sun.jersey.json.impl.provider.entity.JSONRootElementProvider$General\n com.sun.jersey.json.impl.provider.entity.JSONListElementProvider$General\n com.sun.jersey.json.impl.provider.entity.JacksonProviderProxy\n com.sun.jersey.core.impl.provider.entity.XMLRootElementProvider$General\n com.sun.jersey.core.impl.provider.entity.XMLListElementProvider$General\n com.sun.jersey.core.impl.provider.entity.XMLRootObjectProvider$General\n com.sun.jersey.core.impl.provider.entity.EntityHolderReader\n","javaClassName":"javax.ws.rs.WebApplicationException"}}'. Continuing... [E 2018-12-01 19:49:56.673 EnterpriseGatewayApp] Failed to authenticate SSHClient with password-less SSH [W 2018-12-01 19:49:56.673 EnterpriseGatewayApp] Remote signal(15) to '-32518' on host '' failed with exception 'HTTP 403: Failed to authenticate SSHClient with password-less SSH'.

I believe that this has interfered with Kernel lifecycle, and hence I am getting 'orphan kernels'. Orphan kernels are the one for which the spark submit was done, the job goes into running state, but the notebook server does not interact with that kernel, so it does not get shutdown when notebook server is shutdown.