jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Other
623 stars 222 forks source link

Run Enterprise Gateway on Yarn-Master inside Docker #873

Closed dnks23 closed 4 years ago

dnks23 commented 4 years ago

Description

Hi, I would like to create a basic docker-image that I can easily setup & replicate on any existing Yarn-Cluster. Idea would be to run the docker-container on the Yarn-Master with the Gateway targeting that Yarn-Master which is also the Docker-Host in this case. I've seen all the images that are provided in this repo and there are also some from the Jupyter stack. But those seem quite bloated as there are own installations of Hadoop/Spark/etc. in the image...

I wonder whether this is really necessary and what the minimal setup would look like to let the gateway running in the container target the already provided Hadoop/Spark from the Yarn-Master/Docker-Host.

Environment

kevin-bates commented 4 years ago

Hi @dnks23. I believe the "bloated" image you're referring to is enterprise-gateway-demo and is used for demonstration and integration tests and, yes, contains a complete Hadoop/Spark server. The image that would apply to this scenario is enterprise-gateway.

This is similar to #814 but with fewer restrictions. It's similar in that the launched kernels must communicate their connection information back to EG on the response address sent to them during launch. Today, each kernel launch responds to a different port hosted by EG. In EG 3.0 I hope to have a single response address to use so that it can be "published" in these kinds of environments.

I suspect you could circumvent this issue by setting up host-networking so that the response address port is accessible from the regular network. You may need to set the EG_PROHIBITED_LOCAL_IPS environment variable so that the internal docker IP (192.*) doesn't get produced in the response address calculation. We also have an EG_RESPONSE_IP env that could be used to pin the IP to which the remote kernels send their connection information.

Please give it a try and let us know how you progress. Thanks for you interest.

dnks23 commented 4 years ago

Thanks for the good reply @kevin-bates So I spend some days trying to get this to work. Looks like it kind of does now with passing EG_RESPONSE_IP in the kernel.json. BTW: Is this the right place to set this? Can it also be done on a global-basis and not on the kernel-level?

However I am still facing one issue: I do connect with a Jupiter lab (from my local machine) with the --gateway-url option to the Gateway. I can select the kernel to start, application in yarn gets scheduled and goes to ACCEPTED state & logs look good but suddenly I get:

[W 2020-09-11 14:19:10.303 EnterpriseGatewayApp] Termination of application 'application_1599816470122_0023' failed with exception: 'Response finished with status: 405. Details: '.  Continuing...
[D 2020-09-11 14:19:15.358 EnterpriseGatewayApp] YarnClusterProcessProxy.kill, application ID: application_1599816470122_0023, kernel ID: 410c5e3b-60ae-4a29-90bb-abe4659cec4f, state: ACCEPTED, result: <coroutine object BaseProcessProxyABC.kill at 0x7ff3481fe200>
[D 2020-09-11 14:19:15.358 EnterpriseGatewayApp] response socket still open, close it
[E 2020-09-11 14:19:15.359 EnterpriseGatewayApp] KernelID: '410c5e3b-60ae-4a29-90bb-abe4659cec4f' launch timeout due to: YARN resources unavailable after 46.0 seconds for app application_1599816470122_0023, launch timeout: 40.0!  Check YARN configuration.

The Notebook throws the Error on Kernel startup but what I see in the Yarn-UI is that the Application is Running! I played a bit with EG_KERNEL_LAUNCH_TIMEOUT but couldn't get this to work. Any hints on this maybe?

Thanks in advance!

kevin-bates commented 4 years ago

Generally speaking, all envs prefixed with EG_ are read and processed by the Enterprise Gateway, not kernels. When EG_RESPONSE_IP is set, rather than use the local IP determined from within EG, it "blindly" uses that IP to formulate the value for the --RemoteProcessProxy.response_address value used during the launch of each remote kernel. Because your EG is running within a container (and not the local host) you will likely need to set it to the host's IP, since what is determined programmatically will likely be a docker IP. If your current response address is 192.* you'll likely need to add that wildcarded value to EG_PROHIBITED_LOCAL_IPS.

What is likely happening is that the launched kernel is unable to respond back to EG with its connection information on this response address. Since the YARN application is essentially "on its own" at this point (due to other issues) and not terminated (as should have been the case) it will eventually go to a running state. However, since EG has lost "sight" of the kernel, you'll need to terminate the YARN application manually until you get things properly configured. At least this tells us the launch aspect of things is working, just not the required communication back to EG.

What should have happened, even in a failure state, is that EG should have noticed the faulty communication and issued a kill request against the YARN API. But that attempt failed here:

[W 2020-09-11 14:19:10.303 EnterpriseGatewayApp] Termination of application 'application_1599816470122_0023' failed with exception: 'Response finished with status: 405. Details: '.  Continuing...

and this coroutine portion doesn't look correct and implies version-related issues of some sort...

[D 2020-09-11 14:19:15.358 EnterpriseGatewayApp] YarnClusterProcessProxy.kill, application ID: application_1599816470122_0023, kernel ID: 410c5e3b-60ae-4a29-90bb-abe4659cec4f, state: ACCEPTED, result: <coroutine object BaseProcessProxyABC.kill at 0x7ff3481fe200>

leading me to the following questions:

  1. What versions of packages are you running? (Please provide the output of pip freeze)
  2. What version of Hadoop YARN are you running?
  3. Please provide the complete set of log messages from the point the kernel startup is initiated until its failure.
dnks23 commented 4 years ago

Thanks for the detailed explanations! Let me try to answer your questions:

  1. output of pip freeze:
  1. Yarn Version. Thinking about this my first guess is that here might lie the issue I am facing. As I am using a GCP Dataproc Cluster for my Tests (Spark 2.4.6 & Hadoop >= 2.9 only available) I simply set up the Cluster with Hadoop Yarn 2.10. However in the Docker container I simply installed Spark 2.4.6 pre-built for Hadoop 2.7 as I wanted to bypass a custom build of Spark against Hadoop 2.10 for this. Installed Hadoop 2.7.7 binaries in the Docker container and I then just mount all necessary directories for Hadoop/Spark-Config from Docker-Host into the Container to actually get access to HDFS & get Spark working from within the Container using the underlying Dataproc Configuration. This seems to work as I am able to access HDFS and Spark-Shell's are starting up from within the Container without errors. Might it be possible that this version mismatch between Container Hadoop/Yarn & Docker-Host Hadoop/Yarn causes this issue?

Hadoop/Yarn version of Docker-Host: 2.10 Hadoop/Yarn version of Container: 2.7.7 Spark version: 2.4.6

  1. Complete Logs of Kernel Start-Up: What might be also worth to mention is that I zipped a Conda-env for my kernel and distribute this via --conf spark.yarn.dist.archives=/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/spark_python_yarn_cluster.zip#spark_python_yarn_cluster to the workers.
[D 2020-09-14 07:52:06.430 EnterpriseGatewayApp] Starting kernel (async): ['/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh', '--RemoteProcessProxy.kernel-id', '4798307d-dc70-452e-9d0c-0c4e87cb3356', '--RemoteProcessProxy.response-address', '10.132.0.25:39735', '--RemoteProcessProxy.por
t-range', '0..0', '--RemoteProcessProxy.spark-context-initialization-mode', 'lazy']
[D 2020-09-14 07:52:06.430 EnterpriseGatewayApp] Launching kernel: spark-python-yarn-cluster with command: ['/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh', '--RemoteProcessProxy.kernel-id', '4798307d-dc70-452e-9d0c-0c4e87cb3356', '--RemoteProcessProxy.response-address', '10.132.0.25:39735',
 '--RemoteProcessProxy.port-range', '0..0', '--RemoteProcessProxy.spark-context-initialization-mode', 'lazy']
[D 2020-09-14 07:52:06.430 EnterpriseGatewayApp] BaseProcessProxy.launch_process() env: {'PATH': '/usr/lib/jvm/java-1.8.0-openjdk-amd64//bin:/usr/lib/jvm/java-1.8.0-openjdk-amd64//jre:/opt/spark/bin:/opt/spark/sbin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', 'KERNEL
_USERNAME': 'guest', 'KERNEL_LAUNCH_TIMEOUT': '40', 'KERNEL_WORKING_DIR': '/Users/user/path/work_tmp', 'EG_KERNEL_LAUNCH_TIMEOUT': '600', 'HADOOP_CONF_DIR': '/opt/hadoop/etc/hadoop/', 'SPARK_HOME': '/opt/spark', 'SPARK_CONF_DIR': '/opt/spark/conf', 'PYSPARK_PYTHON': '/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/spark_python_yarn_cluster/bin/python', 'PYTHONPATH': '/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/spark_python_yarn_cluster/bin/python', 'SPARK_OPTS': '--master yarn --deploy-mode cluster --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID} --conf spark.yarn.submit.waitAppCompletion=false --con
f spark.yarn.appMasterEnv.PYSPARK_PYTHON= spark_python_yarn_cluster/spark_python_yarn_cluster/bin/python --conf spark.yarn.appMasterEnv.PATH=. spark_python_yarn_cluster/spark_python_yarn_cluster/bin:/usr/lib/jvm/java-1.8.0-openjdk-amd64//bin:/usr/lib/jvm/java-1.8.0-openjdk-amd64//jre:/opt/spark/bin:/opt/spark/sbin:/opt/conda/bin:/usr/lo
cal/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin ${KERNEL_EXTRA_SPARK_OPTS} --conf spark.yarn.dist.archives=/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/spark_python_yarn_cluster.zip#spark_python_yarn_cluster ', 'LAUNCH_OPTS': '', 'KERNEL_GATEWAY': '1', 'EG_MIN_PORT_RANGE_SIZE': '1000', 'EG_MAX_PORT_RANGE_R
ETRIES': '5', 'KERNEL_ID': '4798307d-dc70-452e-9d0c-0c4e87cb3356', 'KERNEL_LANGUAGE': 'python', 'EG_IMPERSONATION_ENABLED': 'False'}
[D 2020-09-14 07:52:06.438 EnterpriseGatewayApp] Yarn cluster kernel launched using YARN RM address: http://yarn-m:8088, pid: 11, Kernel ID: 4798307d-dc70-452e-9d0c-0c4e87cb3356, cmd: '['/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh', '--RemoteProcessProxy.kernel-id', '4798307d-dc70-
452e-9d0c-0c4e87cb3356', '--RemoteProcessProxy.response-address', '10.132.0.25:39735', '--RemoteProcessProxy.port-range', '0..0', '--RemoteProcessProxy.spark-context-initialization-mode', 'lazy']'
Starting IPython kernel for Spark in Yarn Cluster mode on behalf of user guest
+ eval exec /opt/spark/bin/spark-submit '--master yarn --deploy-mode cluster --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID} --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON= spark_python_yarn_cluster/spark_python_yarn_cluster/bin/python --conf spark.yarn.appMasterEnv.PATH=.spark_python_yarn_cluster/spark_python_yarn_cluster/bin:/usr/lib/jvm/java-1.8.0-openjdk-amd64//bin:/usr/lib/jvm/java-1.8.0-openjdk-amd64//jre:/opt/spark/bin:/opt/spark/sbin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin ${KERNEL_EXTRA_SPARK_OPTS} --conf spark.yarn.dist.archives=/usr/local/sha
re/jupyter/kernels/spark_python_yarn_cluster/spark_python_yarn_cluster.zip#spark_python_yarn_cluster ' '' /usr/local/share/jupyter/kernels/spark_python_yarn_cluster/scripts/launch_ipykernel.py '' --RemoteProcessProxy.kernel-id 4798307d-dc70-452e-9d0c-0c4e87cb3356 --RemoteProcessProxy.response-address 10.132.0.25:39735 --RemoteProcessP
roxy.port-range 0..0 --RemoteProcessProxy.spark-context-initialization-mode lazy
++ exec /opt/spark/bin/spark-submit --master yarn --deploy-mode cluster --name 4798307d-dc70-452e-9d0c-0c4e87cb3356 --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON= spark_python_yarn_cluster/spark_python_yarn_cluster/bin/python --conf spark.yarn.appMasterEnv.PATH=.
spark_python_yarn_cluster/spark_python_yarn_cluster/bin:/usr/lib/jvm/java-1.8.0-openjdk-amd64//bin:/usr/lib/jvm/java-1.8.0-openjdk-amd64//jre:/opt/spark/bin:/opt/spark/sbin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin --conf spark.yarn.dist.archives=/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/spark_python_yarn_cluster.zip#spark_python_yarn_cluster /usr/local/share/jupyter/kernels/spark_python_yarn_cluster/scripts/launch_ipykernel.py --RemoteProcessProxy.kernel-id 4798307d-dc70-452e-9d0c-0c4e87cb3356 --RemoteProcessProxy.response-address 10.132.0.25:39735 --RemoteProcessProxy.port-range 0..0 --RemoteProcessPr
oxy.spark-context-initialization-mode lazy
[D 2020-09-14 07:52:06.975 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:07.484 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:07.991 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:08.499 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:09.012 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:09.521 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:10.029 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:10.537 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:11.048 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:11.567 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:12.073 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
20/09/14 07:52:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[D 2020-09-14 07:52:12.580 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:13.088 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:13.597 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
20/09/14 07:52:14 INFO client.RMProxy: Connecting to ResourceManager at yarn-m/10.132.0.25:8032
[D 2020-09-14 07:52:14.117 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
20/09/14 07:52:14 INFO client.AHSProxy: Connecting to Application History server at yarn-m/10.132.0.25:10200
20/09/14 07:52:14 INFO yarn.Client: Requesting a new application from cluster with 2 NodeManagers
[D 2020-09-14 07:52:14.625 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
20/09/14 07:52:14 INFO conf.Configuration: resource-types.xml not found
20/09/14 07:52:14 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/09/14 07:52:14 INFO resource.ResourceUtils: Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE
20/09/14 07:52:14 INFO resource.ResourceUtils: Adding resource type - name = vcores, units = , type = COUNTABLE
20/09/14 07:52:14 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (6144 MB per container)
20/09/14 07:52:14 INFO yarn.Client: Will allocate AM container, with 2304 MB memory including 384 MB overhead
20/09/14 07:52:14 INFO yarn.Client: Setting up container launch context for our AM
20/09/14 07:52:14 INFO yarn.Client: Setting up the launch environment for our AM container
20/09/14 07:52:14 INFO yarn.Client: Preparing resources for our AM container
20/09/14 07:52:14 INFO yarn.Client: Uploading resource file:/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/spark_python_yarn_cluster#spark_python_yarn_cluster -> hdfs://yarn-m/user/root/.sparkStaging/application_1600064376909_0001/spark_python_yarn_cluster
[D 2020-09-14 07:52:15.136 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:15.642 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:16.151 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:16.659 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:17.165 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:17.672 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:18.204 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:18.720 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:19.229 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:19.738 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:20.245 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:20.755 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:21.261 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:21.772 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:22.283 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:22.791 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
20/09/14 07:52:22 INFO yarn.Client: Uploading resource file:/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/scripts/launch_ipykernel.py -> hdfs://yarn-m/user/root/.sparkStaging/application_1600064376909_0001/launch_ipykernel.py
20/09/14 07:52:22 INFO yarn.Client: Uploading resource file:/opt/spark/python/lib/pyspark.zip -> hdfs://yarn-m/user/root/.sparkStaging/application_1600064376909_0001/pyspark.zip
20/09/14 07:52:22 INFO yarn.Client: Uploading resource file:/opt/spark/python/lib/py4j-0.10.7-src.zip -> hdfs://yarn-m/user/root/.sparkStaging/application_1600064376909_0001/py4j-0.10.7-src.zip
[D 2020-09-14 07:52:23.307 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:23.814 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
20/09/14 07:52:23 INFO yarn.Client: Uploading resource file:/hadoop/spark/tmp/spark-c39f9d45-c74c-45d8-8685-f03442796cd4/__spark_conf__5311454277149128232.zip -> hdfs://yarn-m/user/root/.sparkStaging/application_1600064376909_0001/__spark_conf__.zip
20/09/14 07:52:23 INFO spark.SecurityManager: Changing view acls to: root
20/09/14 07:52:23 INFO spark.SecurityManager: Changing modify acls to: root
20/09/14 07:52:23 INFO spark.SecurityManager: Changing view acls groups to: 
20/09/14 07:52:23 INFO spark.SecurityManager: Changing modify acls groups to:
20/09/14 07:52:23 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
[D 2020-09-14 07:52:24.321 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:24.833 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:25.342 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:25.856 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
20/09/14 07:52:25 INFO yarn.Client: Submitting application application_1600064376909_0001 to ResourceManager
[I 2020-09-14 07:52:26.374 EnterpriseGatewayApp] ApplicationID: 'application_1600064376909_0001' assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356', state: SUBMITTED, 20.0 seconds after starting.
[D 2020-09-14 07:52:26.379 EnterpriseGatewayApp] 39: State: 'SUBMITTED', Host: '', KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356', ApplicationID: 'application_1600064376909_0001'
20/09/14 07:52:26 INFO impl.YarnClientImpl: Submitted application application_1600064376909_0001
20/09/14 07:52:26 INFO yarn.Client: Application report for application_1600064376909_0001 (state: ACCEPTED)
20/09/14 07:52:26 INFO yarn.Client: 
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: N/A
         ApplicationMaster RPC port: -1
         queue: default
         start time: 1600069946170
         final status: UNDEFINED
         tracking URL: http://yarn-m:8088/proxy/application_1600064376909_0001/
         user: root
20/09/14 07:52:26 INFO util.ShutdownHookManager: Shutdown hook called
20/09/14 07:52:26 INFO util.ShutdownHookManager: Deleting directory /hadoop/spark/tmp/spark-c39f9d45-c74c-45d8-8685-f03442796cd4
20/09/14 07:52:26 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-5d60fe82-44b5-43af-9b93-3f5528dc9421
[D 2020-09-14 07:52:26.978 EnterpriseGatewayApp] 40: State: 'ACCEPTED', Host: 'yarn-w-0.c.internal', KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356', ApplicationID: 'application_1600064376909_0001'
[D 2020-09-14 07:52:31.984 EnterpriseGatewayApp] Waiting for KernelID '4798307d-dc70-452e-9d0c-0c4e87cb3356' to send connection info from host 'yarn-w-0.c.internal' - retrying...
[D 2020-09-14 07:52:32.493 EnterpriseGatewayApp] 41: State: 'ACCEPTED', Host: 'yarn-w-0.c.internal', KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356', ApplicationID: 'application_1600064376909_0001'
[D 2020-09-14 07:52:37.497 EnterpriseGatewayApp] Waiting for KernelID '4798307d-dc70-452e-9d0c-0c4e87cb3356' to send connection info from host 'yarn-w-0.c.internal' - retrying...
[D 2020-09-14 07:52:38.009 EnterpriseGatewayApp] 42: State: 'ACCEPTED', Host: 'yarn-w-0.c.internal', KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356', ApplicationID: 'application_1600064376909_0001'
[D 2020-09-14 07:52:43.012 EnterpriseGatewayApp] Waiting for KernelID '4798307d-dc70-452e-9d0c-0c4e87cb3356' to send connection info from host 'yarn-w-0.c.internal' - retrying...
[D 2020-09-14 07:52:43.522 EnterpriseGatewayApp] 43: State: 'ACCEPTED', Host: 'yarn-w-0.c.internal', KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356', ApplicationID: 'application_1600064376909_0001'
[D 2020-09-14 07:52:48.527 EnterpriseGatewayApp] Waiting for KernelID '4798307d-dc70-452e-9d0c-0c4e87cb3356' to send connection info from host 'yarn-w-0.c.internal' - retrying...
[W 2020-09-14 07:52:49.056 EnterpriseGatewayApp] Termination of application 'application_1600064376909_0001' failed with exception: 'Response finished with status: 405. Details: '.  Continuing...
[D 2020-09-14 07:52:54.123 EnterpriseGatewayApp] YarnClusterProcessProxy.kill, application ID: application_1600064376909_0001, kernel ID: 4798307d-dc70-452e-9d0c-0c4e87cb3356, state: ACCEPTED, result: <coroutine object BaseProcessProxyABC.kill at 0x7f5aeeb863b0>
/opt/conda/lib/python3.7/site-packages/enterprise_gateway/services/processproxies/yarn.py:329: RuntimeWarning: coroutine 'BaseProcessProxyABC.kill' was never awaited
  self.kill()
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
[D 2020-09-14 07:52:54.202 EnterpriseGatewayApp] response socket still open, close it
[E 2020-09-14 07:52:54.203 EnterpriseGatewayApp] KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' launch timeout due to: YARN resources unavailable after 43.0 seconds for app application_1600064376909_0001, launch timeout: 40.0!  Check YARN configuration.
kevin-bates commented 4 years ago

Hi @dnks23 - thanks for the information. Given that you see the kernel eventually reach RUNNING state, I think one of two things are going on:

  1. The kernel launcher is not being activated (RUNNING) within the launch timeout (default 40s)
  2. The kernel launcher is not able to reach EG with its connection information to 10.132.0.25:39735

When a remote kernel is launched via EG, EG will start listening for a response for the kernel's connection information once it discovers (via the native resource manager's API) a host has been assigned. In the output above, this occurs at 07:52:26.978. Here you see EG enter a 5-second wait cycle for that particular launch (the wait is asynchronous so simultaneous kernel start requests will be interleaved here as well) which, after the overall 40 second period has expired, it gives up.

Increasing the timeout

If your YARN cluster is heavily loaded, this could be an issue and you should increase the kernel launch timeout. The kernel launch timeout is closely tied to the request timeout since kernel starts are essentially a POST request. If you're using Notebook 6.1+ both of these values are synchronized, so setting one appropriately adjusts the other. The kernel launch timeout can be configured on the client by setting the env KERNEL_LAUNCH_TIMEOUT to the number of seconds you want to wait. I would suggest increasing it to 120 for now. The request timeout can be set either via an env (JUPYTER_GATEWAY_REQUEST_TIMEOUT) or a config/command-line option (--GatewayClient.request_timeout). Once configured, restart the notebook server on the client.

Troubleshooting the response

To determine if the response is the issue, you should probably take a look at the kernel launcher output, which should be available in the application's stdout log (check stderr as well, but I tend to find things in stdout). In this case, it would be associated with application_1600064376909_0001. If it cannot send its connection information back to EG, there should be some indication there. We print the payload being returned and its encrypted form.

10.132.0.25 is an internal IP. Does the YARN cluster have access to that IP? Is this value being configured via EG_RESPONSE_IP or is this the IP that EG determined is a local IP address?

Unfortunately, we associate a different port for each kernel launch, so the port number will vary every time (we're currently working on a single-response-address capability that should be available in EG 3.0) so another issue might be that this port isn't accessible from outside the EG container. So, you might want to look into using EG's port-range capabilities and configure EG_PORT_RANGE or (--EnterpriseGatewayApp.port_range) and then start the EG container with that range published via -p.

Upon receipt of the kernel's connection information, the logs should show a kernel-info-request, followed by a kernel-info-reply and WebSocket establishment. Once that completes, you should be able to use your kernel from the notebook.

dnks23 commented 4 years ago

Hey @kevin-bates Thanks again for the detailed explanations and your pointers. I actually started with setting EG_KERNEL_LAUNCH_TIMEOUT=60 on the client-side - and that already did the trick! Kernels start up as expected and is available from the Client-Notebook now. Logs look fine now too!

Thanks for assisting in getting this basic EG-Docker-Yarn setup to work - the single response socket issue seems to be bypassed by using the host-network - although I guess (and like u already mentioned) a proper implementation in one of the next releases would be nice!

kevin-bates commented 4 years ago

Awesome - glad to hear you're moving forward!

Do you mind summarizing what additional items you needed to configure? E.g., EG_RESPONSE_IP, EG_PROHIBITED_LOCAL_IPS, etc?

This may prove useful to others - thank you.

dnks23 commented 4 years ago

I did not need to set any other configuration. Only increasing EG_KERNEL_LAUNCH_TIMEOUT and using the docker-host network did the trick. But I assume that all other items you mentioned may need attention and are worth to inspect in case someone else is dealing with issues when using this setup.