jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Other
623 stars 222 forks source link

python-spark kernel using dev tag fails to start in k8s #520

Closed rayh closed 5 years ago

rayh commented 5 years ago

The error from enterprise-gateway is:

[D 2018-12-07 01:27:07.384 EnterpriseGatewayApp] RemoteMappingKernelManager.start_kernel: spark_python_kubernetes, kernel_username: jovyan
[D 2018-12-07 01:27:07.387 EnterpriseGatewayApp] Instantiating kernel 'Spark - Python (Kubernetes Mode)' with process proxy: enterprise_gateway.services.processproxies.k8s.KubernetesProcessProxy
[D 2018-12-07 01:27:07.387 EnterpriseGatewayApp] Response socket launched on 100.96.3.46, port: 50152 using 5.0s timeout
[D 2018-12-07 01:27:07.388 EnterpriseGatewayApp] Starting kernel: [u'/usr/local/share/jupyter/kernels/spark_python_kubernetes/bin/run.sh', u'/home/eg-svc/.local/share/jupyter/runtime/kernel-01812adb-0fa6-477a-9566-1c69858c3f4d.json', u'--RemoteProcessProxy.response-address', u'100.96.3.46:50152', u'--RemoteProcessProxy.spark-context-initialization-mode', u'lazy']
[D 2018-12-07 01:27:07.388 EnterpriseGatewayApp] Launching kernel: Spark - Python (Kubernetes Mode) with command: [u'/usr/local/share/jupyter/kernels/spark_python_kubernetes/bin/run.sh', u'/home/eg-svc/.local/share/jupyter/runtime/kernel-01812adb-0fa6-477a-9566-1c69858c3f4d.json', u'--RemoteProcessProxy.response-address', u'100.96.3.46:50152', u'--RemoteProcessProxy.spark-context-initialization-mode', u'lazy']
[W 2018-12-07 01:27:07.388 EnterpriseGatewayApp] Shared namespace has been configured.  All kernels will reside in EG namespace: eliiza-dsp
[D 2018-12-07 01:27:07.388 EnterpriseGatewayApp] BaseProcessProxy.launch_process() env: {'KUBE_LEGO_NGINX_PORT_8080_TCP_PROTO': 'tcp', 'PROXY_PUBLIC_SERVICE_PORT_HTTP': '80', 'ENTERPRISE_GATEWAY_PORT_8888_TCP_PROTO': 'tcp', 'ENTERPRISE_GATEWAY_PORT_8888_TCP_ADDR': '100.66.187.119', 'KUBE_LEGO_NGINX_PORT_8080_TCP': 'tcp://100.70.32.215:8080', 'PROXY_HTTP_SERVICE_PORT': '8000', 'PROXY_PUBLIC_PORT_443_TCP_ADDR': '100.65.119.179', 'KERNEL_EXECUTOR_IMAGE': u'elyra/kernel-spark-py:2.0.0.dev0', 'EG_CULL_INTERVAL': '30', 'HOME': '/home/eg-svc', 'PROXY_PUBLIC_SERVICE_PORT': '80', 'EG_SSH_PORT': '2122', 'HUB_SERVICE_PORT': '8081', 'LANG': 'C.UTF-8', 'PROXY_PUBLIC_PORT': 'tcp://100.65.119.179:80', 'PROXY_HTTP_PORT': 'tcp://100.66.62.200:8000', 'PROXY_HTTP_PORT_8000_TCP': 'tcp://100.66.62.200:8000', 'ENTERPRISE_GATEWAY_SERVICE_PORT_HTTP': '8888', 'KUBERNETES_SERVICE_HOST': '100.64.0.1', 'KERNEL_GATEWAY': '1', 'PROXY_PUBLIC_PORT_80_TCP': 'tcp://100.65.119.179:80', 'JAVA_HOME': '/usr/lib/jvm/java-1.8-openjdk', u'LAUNCH_OPTS': u'', 'EG_CULL_CONNECTED': 'True', 'KG_IP': '0.0.0.0', 'PROXY_API_SERVICE_HOST': '100.70.170.232', 'PROXY_HTTP_PORT_8000_TCP_ADDR': '100.66.62.200', 'KUBE_LEGO_NGINX_SERVICE_PORT': '8080', 'PYTHONPATH': '/opt/spark/python/lib/pyspark.zip:/opt/spark/python/lib/py4j-*.zip', 'ENTERPRISE_GATEWAY_SERVICE_PORT': '8888', 'EG_NAMESPACE': 'eliiza-dsp', 'PROXY_API_PORT_8001_TCP_ADDR': '100.70.170.232', 'PROXY_PUBLIC_SERVICE_HOST': '100.65.119.179', 'ENTERPRISE_GATEWAY_PORT_8888_TCP_PORT': '8888', 'PROXY_API_SERVICE_PORT': '8001', 'PROXY_API_PORT': 'tcp://100.70.170.232:8001', 'KUBERNETES_PORT': 'tcp://100.64.0.1:443', 'EG_MIN_PORT_RANGE_SIZE': '1000', 'EG_KERNEL_CLUSTER_ROLE': 'kernel-controller', 'KUBERNETES_SERVICE_PORT_HTTPS': '443', 'KERNEL_SERVICE_ACCOUNT_NAME': 'default', 'ENTERPRISE_GATEWAY_PORT': 'tcp://100.66.187.119:8888', u'SPARK_OPTS': u'--master k8s://https://${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT} --deploy-mode cluster --name ${KERNEL_USERNAME}-${KERNEL_ID} --kubernetes-namespace ${KERNEL_NAMESPACE} --conf spark.kubernetes.driver.label.app=enterprise-gateway --conf spark.kubernetes.driver.label.kernel_id=${KERNEL_ID} --conf spark.kubernetes.driver.label.component=kernel --conf spark.kubernetes.executor.label.app=enterprise-gateway --conf spark.kubernetes.executor.label.kernel_id=${KERNEL_ID} --conf spark.kubernetes.executor.label.component=kernel --conf spark.kubernetes.driver.docker.image=${KERNEL_IMAGE} --conf spark.kubernetes.executor.docker.image=${KERNEL_EXECUTOR_IMAGE} --conf spark.kubernetes.authenticate.driver.serviceAccountName=${KERNEL_SERVICE_ACCOUNT_NAME} --conf spark.kubernetes.submission.waitAppCompletion=false', 'HUB_PORT_8081_TCP_PORT': '8081', 'KUBERNETES_PORT_443_TCP': 'tcp://100.64.0.1:443', 'PROXY_PUBLIC_PORT_80_TCP_ADDR': '100.65.119.179', 'KUBERNETES_PORT_443_TCP_PORT': '443', 'EG_CULL_IDLE_TIMEOUT': '600', 'KG_PORT_RETRIES': '0', 'HOSTNAME': 'enterprise-gateway-668f6cffdb-zdkpt', 'PROXY_HTTP_PORT_8000_TCP_PORT': '8000', 'KERNEL_ID': u'01812adb-0fa6-477a-9566-1c69858c3f4d', 'SPARK_HOME': u'/opt/spark', 'EG_ENABLE_TUNNELING': 'False', 'KERNEL_IMAGE': u'elyra/kernel-spark-py:2.0.0.dev0', 'ENTERPRISE_GATEWAY_SERVICE_HOST': '100.66.187.119', 'PATH': '/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/lib/jvm/java-1.8-openjdk/jre/bin:/usr/lib/jvm/java-1.8-openjdk/bin', 'PROXY_PUBLIC_PORT_443_TCP': 'tcp://100.65.119.179:443', 'KUBE_LEGO_NGINX_PORT_8080_TCP_PORT': '8080', 'KUBERNETES_SERVICE_PORT': '443', 'KUBE_LEGO_NGINX_PORT_8080_TCP_ADDR': '100.70.32.215', 'JAVA_ALPINE_VERSION': '8.181.13-r0', 'PROXY_HTTP_PORT_8000_TCP_PROTO': 'tcp', 'SHLVL': '3', 'EG_SHARED_NAMESPACE': 'True', 'EG_IMPERSONATION_ENABLED': 'False', 'KUBE_LEGO_NGINX_PORT': 'tcp://100.70.32.215:8080', 'PROXY_PUBLIC_PORT_443_TCP_PORT': '443', 'KG_PORT': '8888', 'EG_KERNEL_LAUNCH_TIMEOUT': '60', 'KERNEL_LANGUAGE': u'python', 'KERNEL_NAMESPACE': 'eliiza-dsp', 'ENTERPRISE_GATEWAY_PORT_8888_TCP': 'tcp://100.66.187.119:8888', 'EG_LOG_LEVEL': 'DEBUG', 'PROXY_HTTP_SERVICE_HOST': '100.66.62.200', 'HUB_PORT_8081_TCP_PROTO': 'tcp', 'PROXY_API_PORT_8001_TCP_PROTO': 'tcp', u'KERNEL_USERNAME': u'jovyan', 'HUB_SERVICE_HOST': '100.69.114.161', 'PROXY_API_PORT_8001_TCP_PORT': '8001', 'KUBERNETES_PORT_443_TCP_ADDR': '100.64.0.1', 'EG_TUNNELING_ENABLED': 'False', 'PROXY_PUBLIC_SERVICE_PORT_HTTPS': '443', '_': '/usr/bin/jupyter', 'HUB_PORT_8081_TCP_ADDR': '100.69.114.161', 'HUB_PORT': 'tcp://100.69.114.161:8081', 'KUBERNETES_PORT_443_TCP_PROTO': 'tcp', 'EG_MAX_PORT_RANGE_RETRIES': '5', 'JAVA_VERSION': '8u181', 'HUB_PORT_8081_TCP': 'tcp://100.69.114.161:8081', 'PROXY_API_PORT_8001_TCP': 'tcp://100.70.170.232:8001', 'PROXY_PUBLIC_PORT_80_TCP_PROTO': 'tcp', 'KUBE_LEGO_NGINX_SERVICE_HOST': '100.70.32.215', 'PWD': '/usr/local/share/jupyter', 'PROXY_PUBLIC_PORT_80_TCP_PORT': '80', 'PROXY_PUBLIC_PORT_443_TCP_PROTO': 'tcp', 'EG_KERNEL_WHITELIST': "['r_kubernetes','python_kubernetes','python_tf_kubernetes','scala_kubernetes','spark_r_kubernetes','spark_python_kubernetes','spark_scala_kubernetes']"}
[I 2018-12-07 01:27:07.393 EnterpriseGatewayApp] KubernetesProcessProxy: kernel launched. Kernel image: elyra/kernel-spark-py:2.0.0.dev0, KernelID: 01812adb-0fa6-477a-9566-1c69858c3f4d, cmd: '[u'/usr/local/share/jupyter/kernels/spark_python_kubernetes/bin/run.sh', u'/home/eg-svc/.local/share/jupyter/runtime/kernel-01812adb-0fa6-477a-9566-1c69858c3f4d.json', u'--RemoteProcessProxy.response-address', u'100.96.3.46:50152', u'--RemoteProcessProxy.spark-context-initialization-mode', u'lazy']'

Starting IPython kernel for Spark in Kubernetes mode on behalf of user jovyan

+ eval exec /opt/spark/bin/spark-submit '--master k8s://https://${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT} --deploy-mode cluster --name ${KERNEL_USERNAME}-${KERNEL_ID} --kubernetes-namespace ${KERNEL_NAMESPACE} --conf spark.kubernetes.driver.label.app=enterprise-gateway --conf spark.kubernetes.driver.label.kernel_id=${KERNEL_ID} --conf spark.kubernetes.driver.label.component=kernel --conf spark.kubernetes.executor.label.app=enterprise-gateway --conf spark.kubernetes.executor.label.kernel_id=${KERNEL_ID} --conf spark.kubernetes.executor.label.component=kernel --conf spark.kubernetes.driver.docker.image=${KERNEL_IMAGE} --conf spark.kubernetes.executor.docker.image=${KERNEL_EXECUTOR_IMAGE} --conf spark.kubernetes.authenticate.driver.serviceAccountName=${KERNEL_SERVICE_ACCOUNT_NAME} --conf spark.kubernetes.submission.waitAppCompletion=false' local:///usr/local/share/jupyter/kernels/spark_python_kubernetes/scripts/launch_ipykernel.py '' /home/eg-svc/.local/share/jupyter/runtime/kernel-01812adb-0fa6-477a-9566-1c69858c3f4d.json --RemoteProcessProxy.response-address 100.96.3.46:50152 --RemoteProcessProxy.spark-context-initialization-mode lazy
++ exec /opt/spark/bin/spark-submit --master k8s://https://100.64.0.1:443 --deploy-mode cluster --name jovyan-01812adb-0fa6-477a-9566-1c69858c3f4d --kubernetes-namespace eliiza-dsp --conf spark.kubernetes.driver.label.app=enterprise-gateway --conf spark.kubernetes.driver.label.kernel_id=01812adb-0fa6-477a-9566-1c69858c3f4d --conf spark.kubernetes.driver.label.component=kernel --conf spark.kubernetes.executor.label.app=enterprise-gateway --conf spark.kubernetes.executor.label.kernel_id=01812adb-0fa6-477a-9566-1c69858c3f4d --conf spark.kubernetes.executor.label.component=kernel --conf spark.kubernetes.driver.docker.image=elyra/kernel-spark-py:2.0.0.dev0 --conf spark.kubernetes.executor.docker.image=elyra/kernel-spark-py:2.0.0.dev0 --conf spark.kubernetes.authenticate.driver.serviceAccountName=default --conf spark.kubernetes.submission.waitAppCompletion=false local:///usr/local/share/jupyter/kernels/spark_python_kubernetes/scripts/launch_ipykernel.py /home/eg-svc/.local/share/jupyter/runtime/kernel-01812adb-0fa6-477a-9566-1c69858c3f4d.json --RemoteProcessProxy.response-address 100.96.3.46:50152 --RemoteProcessProxy.spark-context-initialization-mode lazy
Error: Unrecognized option: --kubernetes-namespace

[D 2018-12-07 01:27:07.998 EnterpriseGatewayApp] 1: Waiting to connect to k8s pod in namespace 'eliiza-dsp'. Name: '', Status: 'None', Pod IP: 'None', KernelID: '01812adb-0fa6-477a-9566-1c69858c3f4d'
[D 2018-12-07 01:27:08.601 EnterpriseGatewayApp] 2: Waiting to connect to k8s pod in namespace 'eliiza-dsp'. Name: '', Status: 'None', Pod IP: 'None', KernelID: '01812adb-0fa6-477a-9566-1c69858c3f4d'
Usage: spark-submit [options] <app jar | python file | R file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn,
                              k8s://https://host:port, or local (Default: local[*]).
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.
  --exclude-packages          Comma-separated list of groupId:artifactId, to exclude while
                              resolving the dependencies provided in --packages to avoid
                              dependency conflicts.
  --repositories              Comma-separated list of additional remote repositories to
                              search for the maven coordinates given with --packages.
  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor. File paths of these files
                              in executors can be accessed via SparkFiles.get(fileName).

  --conf PROP=VALUE           Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).

  --proxy-user NAME           User to impersonate when submitting the application.
                              This argument does not work with --principal / --keytab.

  --help, -h                  Show this help message and exit.
  --verbose, -v               Print additional debug output.
  --version,                  Print the version of current Spark.

 Cluster deploy mode only:
  --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).

 Spark standalone or Mesos with cluster deploy mode only:
  --supervise                 If given, restarts the driver on failure.
  --kill SUBMISSION_ID        If given, kills the driver specified.
  --status SUBMISSION_ID      If given, requests the status of the driver specified.

 Spark standalone and Mesos only:
  --total-executor-cores NUM  Total cores for all executors.

 Spark standalone and YARN only:
  --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                              or all available cores on the worker in standalone mode)

 YARN-only:
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --num-executors NUM         Number of executors to launch (Default: 2).
                              If dynamic allocation is enabled, the initial number of
                              executors will be at least NUM.
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.
  --principal PRINCIPAL       Principal to be used to login to KDC, while running on
                              secure HDFS.
  --keytab KEYTAB             The full path to the file that contains the keytab for the
                              principal specified above. This keytab will be copied to
                              the node running the Application Master via the Secure
                              Distributed Cache, for renewing the login tickets and the
                              delegation tokens periodically.

[D 2018-12-07 01:27:09.204 EnterpriseGatewayApp] 3: Waiting to connect to k8s pod in namespace 'eliiza-dsp'. Name: '', Status: 'None', Pod IP: 'None', KernelID: '01812adb-0fa6-477a-9566-1c69858c3f4d'
[E 2018-12-07 01:27:09.204 EnterpriseGatewayApp] Error occurred during launch of KernelID: 01812adb-0fa6-477a-9566-1c69858c3f4d.  Check Enterprise Gateway log for more information.
rayh commented 5 years ago

According to the docs (https://spark.apache.org/docs/latest/running-on-kubernetes.html#namespaces), it should be configured by spark.kubernetes.namespace

The current kernelspec in master seems to be correct: https://github.com/jupyter/enterprise_gateway/blob/master/etc/kernelspecs/spark_python_kubernetes/kernel.json

kevin-bates commented 5 years ago

@rayh - thank you for the update. You are correct. Although the tags in dockerhub show tags with :dev, I don't believe we updated the hub images despite all the PRs we've merged. I just pulled the elyra/enterprise-gateway:dev image and its kernelspecs (in its /usr/local/share/jupyter/kernels/ directory) show references to images with :2.0.0.dev0 tags. In addition, the containers are running as user eg-svc, while checked kernel images show default users of eg-kernel. Tag references should be :dev and users should be jovyan.

We recently changed both of these areas, along with the changes to Spark 2.4. The repo items you reference are correct. Just that the images in dockerhub are not.

If it helps, I'm running with good images. 😃

I will be pulling master, building EG, kernelspecs, and all images, followed by their push into docker hub. Following the push, I will post back to this issue the digest hashes for each of the pushed images.

I hope to have this done early tomorrow PST. Thank you for your patience.

rayh commented 5 years ago

Spark + R is also affected.

Thanks for this - I'm pretty excited about this approach - If I can get the holy grail of R/Python + Spark + k8s all running reliably for our team, I think they will be pretty happy.

kevin-bates commented 5 years ago

@rayh - that's great!

Since the images were hosed, I went ahead and pushed tonight after visually inspecting a few. I'll deploy these tomorrow, but feel free to take them for a spin. Here are their digest values:

elyra/enterprise-gateway:dev - digest: sha256:682dad85ee0b328834426caf1cafef99e66e5a02b5d4a2fa6451a9ced29d1596 size: 4733
elyra/kernel-py:dev - digest: sha256:f445b660a594858ec68766f213b0b432dad036585b739967932fe7926abab47b size: 5965
elyra/kernel-spark-py:dev - digest: sha256:1f955eb135fe5708290c41def093a5f15602f862d1c04e1cebc51f8c50caeaab size: 3881
elyra/kernel-tf-py:dev - digest: sha256:ad2e519f950e30da940fe9b79206f71ccb4bacb31784c6840511a1df136188dc size: 4087
elyra/kernel-tf-gpu-py:dev - digest: sha256:62872920d088c33373f686dcb2deb4006d002548710970839382c4e442f7b45b size: 4931
elyra/kernel-r:dev - digest: sha256:6147eea4d62f04264c847ed97f11eefb1bd34189ed8831fb11964d15514deb8f size: 5544
elyra/kernel-spark-r:dev - digest: sha256:6337becd72098cca32731f8ddeb57772269606941ffbd320c0648dc30d1f0c18 size: 4095
elyra/kernel-scala:dev - digest: sha256:61dcaf3e3b921db54f22fb23db98bc4397d6e7ed0ce29f9226aa36ba79548694 size: 3256
elyra/nb2kg:dev - digest: sha256:a99e442a3ec4adab799de9a87740d46ff79d0d8f0964d30c4404a8c9f848f9f3 size: 5542

Looks like the enterprise-gateway.yaml file is good to go as well - in that it references the correct image tag - despite the comment.

kevin-bates commented 5 years ago

@rayh - I've confirmed the new images. Here's a snapshot from a notebook launching the Spark - Python (Kubernetes Mode) kernel...

screen shot 2018-12-07 at 11 38 26 am

Add here are the details of docker images on one of my nodes - in case the image ID helps:

REPOSITORY                                     TAG                 IMAGE ID            CREATED             SIZE
docker.io/elyra/kernel-tf-gpu-py               dev                 979b8133f1d6        15 hours ago        3.28 GB
docker.io/elyra/kernel-tf-py                   dev                 0d7e09684127        15 hours ago        1.26 GB
docker.io/elyra/kernel-scala                   dev                 be05f0c1a26e        15 hours ago        450 MB
docker.io/elyra/kernel-spark-r                 dev                 05d89e4a90c0        15 hours ago        928 MB
docker.io/elyra/kernel-r                       dev                 5d72e40b035f        15 hours ago        3.57 GB
docker.io/elyra/kernel-spark-py                dev                 077ba6623a40        15 hours ago        662 MB
docker.io/elyra/kernel-py                      dev                 98421ccfdc55        15 hours ago        4.64 GB
docker.io/elyra/enterprise-gateway             dev                 5f0fa927be71        15 hours ago        1.22 GB

Once you confirm similar results, we'll close the issue.

Thanks, and sorry for the inconvenience.

rayh commented 5 years ago

Ok, I managed to update the images in the cluster (I added imagePullPolicy: Always to the enterprise-gateway.yaml and also added the kernel images to jupyter-hub's continuous-puller with the policy of Always)

However, I am now getting RBAC issues -

    2018-12-09 23:12:19 INFO  SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://spark-1544397133506-driver-svc.eliiza-dsp.svc:4040
2018-12-09 23:12:19 INFO  SparkContext:54 - Added file file:///usr/local/share/jupyter/kernel-launchers/python/scripts/launch_ipykernel.py at spark://spark-1544397133506-driver-svc.eliiza-dsp.svc:7078/files/launch_ipykernel.py with timestamp 1544397139439
2018-12-09 23:12:19 INFO  Utils:54 - Copying /usr/local/share/jupyter/kernel-launchers/python/scripts/launch_ipykernel.py to /var/data/spark-ac98e62c-ba95-458d-bc16-7138d77e0bab/spark-fb3fc412-42f8-4604-a062-6ccb35cdb190/userFiles-5ba3ca97-8452-42a8-9170-c3cd1355e02a/launch_ipykernel.py
2018-12-09 23:12:20 ERROR SparkContext:91 - Error initializing SparkContext.
org.apache.spark.SparkException: External scheduler cannot be instantiated
    at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2794)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:493)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:238)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://kubernetes.default.svc/api/v1/namespaces/eliiza-dsp/pods/jovyan-aed71ed6-1823-40df-9b70-72c717d61b3a-1544397133169-driver. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "jovyan-aed71ed6-1823-40df-9b70-72c717d61b3a-1544397133169-driver" is forbidden: User "system:serviceaccount:eliiza-dsp:default" cannot get pods in the namespace "eliiza-dsp".
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:470)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:407)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:379)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:343)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:312)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:295)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:783)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:217)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:184)
    at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:57)
    at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:55)
    at scala.Option.map(Option.scala:146)
    at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.<init>(ExecutorPodsAllocator.scala:55)
    at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:89)
    at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2788)
    ... 13 more
2018-12-09 23:12:20 INFO  AbstractConnector:318 - Stopped Spark@fec2cd3{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-12-09 23:12:20 INFO  SparkUI:54 - Stopped Spark web UI at http://spark-1544397133506-driver-svc.eliiza-dsp.svc:4040
2018-12-09 23:12:20 INFO  MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2018-12-09 23:12:20 INFO  MemoryStore:54 - MemoryStore cleared
2018-12-09 23:12:20 INFO  BlockManager:54 - BlockManager stopped
2018-12-09 23:12:20 INFO  BlockManagerMaster:54 - BlockManagerMaster stopped
2018-12-09 23:12:20 WARN  MetricsSystem:66 - Stopping a MetricsSystem that is not running
2018-12-09 23:12:20 INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2018-12-09 23:12:20 INFO  SparkContext:54 - Successfully stopped SparkContext

Should it be trying to use the "default" account?

rayh commented 5 years ago

Ok, I just had to create the RBAC role for spark (see below) and now I get:

image

The extra cluster role & binding for spark:

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  # Referenced by EG_KERNEL_CLUSTER_ROLE below
  name: spark-role
  labels:
    app: spark
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "watch", "list", "create", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: spark-binding
  labels:
    app: spark
subjects:
  - kind: ServiceAccount
    name: default
    namespace: eliiza-dsp
roleRef:
  kind: ClusterRole
  name: spark-role
  apiGroup: rbac.authorization.k8s.io
kevin-bates commented 5 years ago

This is great news @rayh - thanks for digging deeper and making Enterprise Gateway better! I have a couple comments and a question regarding your last comments.

  1. We purposely didn't want an imagePullPolicy - at least for the kernel images - because pulling the kernel images counts against the initial start up time for each "virgin" node - and will surely timeout the kernel creation request. However, if this policy is for the enterprise-gateway image, then by all means that makes total sense, so please contribute that change back with a pull request, if you don't mind.
  2. The jupyter-hubs continuous puller policy sounds interesting (and perfect) for the kernel images. Can you describe what you actually did and how it works (or point to the pertinent section of the docs)? This sounds like a useful tip for your docs as well.
  3. Your RBAC issues are because you're using bring your own namespace. Although I had made some code-based RBAC changes when we moved to Spark 2.4, I failed to go check things out with BYONamespace - I'm sorry. I have just gone and reproduced the issue and confirmed your RBAC script (substituting my custom namespace). We should update our docs with this information.

I plan on making a doc sweep tomorrow or Tuesday and will add this information. Please let me know if there are items in the docs that are missing or too hard to find. In my pass yesterday, I felt like we need to come out and give useful commands early on.

rayh commented 5 years ago
  1. Yes, this is the enterprise-gateway

  2. jupyterhub's continuous puller just exists as a demonset and periodically pulls the image so they the latest always available. - See https://zero-to-jupyterhub.readthedocs.io/en/stable/optimization.html?highlight=prepuller

  3. No worries, luckily it didnt need much. I do need to review how we're using namespaces in general.

The main thing that I did that deviated from your instructions was creating my own notebook image as I was using a newer version of dockerhub (and so the hub integration didnt work). I also added a couple of other extensions and ssh:

FROM jupyter/minimal-notebook:latest

# Do the pip installs as the unprivileged notebook user
USER $NB_USER

ADD jupyter_notebook_config.py /etc/jupyter/jupyter_notebook_config.py

# Install NB2KG
# RUN pip install --upgrade nb2kg && \
RUN pip install "git+https://github.com/jupyter-incubator/nb2kg.git#egg=nb2kg" && \
    jupyter serverextension enable --py nb2kg --sys-prefix

# Git support: https://github.com/jupyterlab/jupyterlab-git
RUN jupyter labextension install @jupyterlab/git && \
  pip install jupyterlab-git && \
  jupyter serverextension enable --py jupyterlab_git

# HTML support: https://github.com/mflevine/jupyterlab_html
RUN jupyter labextension install @mflevine/jupyterlab_html

# Latex support: https://github.com/jupyterlab/jupyterlab-latex
RUN pip install jupyterlab_latex && \
  jupyter labextension install @jupyterlab/latex

USER root

RUN apt update && apt install -y ssh

USER $NB_USER

This exists as eliiza/kernel-gateway-notebook - but i'll also publish the dockerfile to github

rayh commented 5 years ago

Pushed the notebook here: https://github.com/eliiza/kernel-gateway-notebook

kevin-bates commented 5 years ago

@rayh - are you planning on updating enterprise-gateway.yaml with the imagePullPolicy?

We may be cutting a beta release soon is the reason I'm asking. I'd like to have this in for that release.

When you said above...

I was using a newer version of dockerhub (and so the hub integration didnt work)

Did you mean jupyter hub?

rayh commented 5 years ago

Ahem, yes, I meant JupyterHub (0.7) - the hub extension in the nb2kg image seemed to be too old (and perhaps I was using the wrong tag at the time)

I'll submit a PR for the imagePullPolicy

kevin-bates commented 5 years ago

I'm going to close this issue since PR #525 has been merged and the other items are being addressed/discussed in other issues.

Thank you.