Closed rayh closed 5 years ago
According to the docs (https://spark.apache.org/docs/latest/running-on-kubernetes.html#namespaces), it should be configured by spark.kubernetes.namespace
The current kernelspec in master seems to be correct: https://github.com/jupyter/enterprise_gateway/blob/master/etc/kernelspecs/spark_python_kubernetes/kernel.json
@rayh - thank you for the update. You are correct. Although the tags in dockerhub show tags with :dev
, I don't believe we updated the hub images despite all the PRs we've merged. I just pulled the elyra/enterprise-gateway:dev
image and its kernelspecs (in its /usr/local/share/jupyter/kernels/
directory) show references to images with :2.0.0.dev0
tags. In addition, the containers are running as user eg-svc
, while checked kernel images show default users of eg-kernel
. Tag references should be :dev
and users should be jovyan
.
We recently changed both of these areas, along with the changes to Spark 2.4. The repo items you reference are correct. Just that the images in dockerhub are not.
If it helps, I'm running with good images. 😃
I will be pulling master, building EG, kernelspecs, and all images, followed by their push into docker hub. Following the push, I will post back to this issue the digest hashes for each of the pushed images.
I hope to have this done early tomorrow PST. Thank you for your patience.
Spark + R is also affected.
Thanks for this - I'm pretty excited about this approach - If I can get the holy grail of R/Python + Spark + k8s all running reliably for our team, I think they will be pretty happy.
@rayh - that's great!
Since the images were hosed, I went ahead and pushed tonight after visually inspecting a few. I'll deploy these tomorrow, but feel free to take them for a spin. Here are their digest values:
elyra/enterprise-gateway:dev - digest: sha256:682dad85ee0b328834426caf1cafef99e66e5a02b5d4a2fa6451a9ced29d1596 size: 4733
elyra/kernel-py:dev - digest: sha256:f445b660a594858ec68766f213b0b432dad036585b739967932fe7926abab47b size: 5965
elyra/kernel-spark-py:dev - digest: sha256:1f955eb135fe5708290c41def093a5f15602f862d1c04e1cebc51f8c50caeaab size: 3881
elyra/kernel-tf-py:dev - digest: sha256:ad2e519f950e30da940fe9b79206f71ccb4bacb31784c6840511a1df136188dc size: 4087
elyra/kernel-tf-gpu-py:dev - digest: sha256:62872920d088c33373f686dcb2deb4006d002548710970839382c4e442f7b45b size: 4931
elyra/kernel-r:dev - digest: sha256:6147eea4d62f04264c847ed97f11eefb1bd34189ed8831fb11964d15514deb8f size: 5544
elyra/kernel-spark-r:dev - digest: sha256:6337becd72098cca32731f8ddeb57772269606941ffbd320c0648dc30d1f0c18 size: 4095
elyra/kernel-scala:dev - digest: sha256:61dcaf3e3b921db54f22fb23db98bc4397d6e7ed0ce29f9226aa36ba79548694 size: 3256
elyra/nb2kg:dev - digest: sha256:a99e442a3ec4adab799de9a87740d46ff79d0d8f0964d30c4404a8c9f848f9f3 size: 5542
Looks like the enterprise-gateway.yaml
file is good to go as well - in that it references the correct image tag - despite the comment.
@rayh - I've confirmed the new images. Here's a snapshot from a notebook launching the Spark - Python (Kubernetes Mode)
kernel...
Add here are the details of docker images
on one of my nodes - in case the image ID helps:
REPOSITORY TAG IMAGE ID CREATED SIZE
docker.io/elyra/kernel-tf-gpu-py dev 979b8133f1d6 15 hours ago 3.28 GB
docker.io/elyra/kernel-tf-py dev 0d7e09684127 15 hours ago 1.26 GB
docker.io/elyra/kernel-scala dev be05f0c1a26e 15 hours ago 450 MB
docker.io/elyra/kernel-spark-r dev 05d89e4a90c0 15 hours ago 928 MB
docker.io/elyra/kernel-r dev 5d72e40b035f 15 hours ago 3.57 GB
docker.io/elyra/kernel-spark-py dev 077ba6623a40 15 hours ago 662 MB
docker.io/elyra/kernel-py dev 98421ccfdc55 15 hours ago 4.64 GB
docker.io/elyra/enterprise-gateway dev 5f0fa927be71 15 hours ago 1.22 GB
Once you confirm similar results, we'll close the issue.
Thanks, and sorry for the inconvenience.
Ok, I managed to update the images in the cluster (I added imagePullPolicy: Always
to the enterprise-gateway.yaml and also added the kernel images to jupyter-hub's continuous-puller with the policy of Always)
However, I am now getting RBAC issues -
2018-12-09 23:12:19 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://spark-1544397133506-driver-svc.eliiza-dsp.svc:4040
2018-12-09 23:12:19 INFO SparkContext:54 - Added file file:///usr/local/share/jupyter/kernel-launchers/python/scripts/launch_ipykernel.py at spark://spark-1544397133506-driver-svc.eliiza-dsp.svc:7078/files/launch_ipykernel.py with timestamp 1544397139439
2018-12-09 23:12:19 INFO Utils:54 - Copying /usr/local/share/jupyter/kernel-launchers/python/scripts/launch_ipykernel.py to /var/data/spark-ac98e62c-ba95-458d-bc16-7138d77e0bab/spark-fb3fc412-42f8-4604-a062-6ccb35cdb190/userFiles-5ba3ca97-8452-42a8-9170-c3cd1355e02a/launch_ipykernel.py
2018-12-09 23:12:20 ERROR SparkContext:91 - Error initializing SparkContext.
org.apache.spark.SparkException: External scheduler cannot be instantiated
at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2794)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:493)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://kubernetes.default.svc/api/v1/namespaces/eliiza-dsp/pods/jovyan-aed71ed6-1823-40df-9b70-72c717d61b3a-1544397133169-driver. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "jovyan-aed71ed6-1823-40df-9b70-72c717d61b3a-1544397133169-driver" is forbidden: User "system:serviceaccount:eliiza-dsp:default" cannot get pods in the namespace "eliiza-dsp".
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:470)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:407)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:379)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:343)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:312)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:295)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:783)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:217)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:184)
at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:57)
at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:55)
at scala.Option.map(Option.scala:146)
at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.<init>(ExecutorPodsAllocator.scala:55)
at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:89)
at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2788)
... 13 more
2018-12-09 23:12:20 INFO AbstractConnector:318 - Stopped Spark@fec2cd3{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-12-09 23:12:20 INFO SparkUI:54 - Stopped Spark web UI at http://spark-1544397133506-driver-svc.eliiza-dsp.svc:4040
2018-12-09 23:12:20 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2018-12-09 23:12:20 INFO MemoryStore:54 - MemoryStore cleared
2018-12-09 23:12:20 INFO BlockManager:54 - BlockManager stopped
2018-12-09 23:12:20 INFO BlockManagerMaster:54 - BlockManagerMaster stopped
2018-12-09 23:12:20 WARN MetricsSystem:66 - Stopping a MetricsSystem that is not running
2018-12-09 23:12:20 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2018-12-09 23:12:20 INFO SparkContext:54 - Successfully stopped SparkContext
Should it be trying to use the "default" account?
Ok, I just had to create the RBAC role for spark (see below) and now I get:
The extra cluster role & binding for spark:
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
# Referenced by EG_KERNEL_CLUSTER_ROLE below
name: spark-role
labels:
app: spark
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "watch", "list", "create", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: spark-binding
labels:
app: spark
subjects:
- kind: ServiceAccount
name: default
namespace: eliiza-dsp
roleRef:
kind: ClusterRole
name: spark-role
apiGroup: rbac.authorization.k8s.io
This is great news @rayh - thanks for digging deeper and making Enterprise Gateway better! I have a couple comments and a question regarding your last comments.
imagePullPolicy
- at least for the kernel images - because pulling the kernel images counts against the initial start up time for each "virgin" node - and will surely timeout the kernel creation request. However, if this policy is for the enterprise-gateway image, then by all means that makes total sense, so please contribute that change back with a pull request, if you don't mind.I plan on making a doc sweep tomorrow or Tuesday and will add this information. Please let me know if there are items in the docs that are missing or too hard to find. In my pass yesterday, I felt like we need to come out and give useful commands early on.
Yes, this is the enterprise-gateway
jupyterhub's continuous puller just exists as a demonset and periodically pulls the image so they the latest always available. - See https://zero-to-jupyterhub.readthedocs.io/en/stable/optimization.html?highlight=prepuller
No worries, luckily it didnt need much. I do need to review how we're using namespaces in general.
The main thing that I did that deviated from your instructions was creating my own notebook image as I was using a newer version of dockerhub (and so the hub integration didnt work). I also added a couple of other extensions and ssh:
FROM jupyter/minimal-notebook:latest
# Do the pip installs as the unprivileged notebook user
USER $NB_USER
ADD jupyter_notebook_config.py /etc/jupyter/jupyter_notebook_config.py
# Install NB2KG
# RUN pip install --upgrade nb2kg && \
RUN pip install "git+https://github.com/jupyter-incubator/nb2kg.git#egg=nb2kg" && \
jupyter serverextension enable --py nb2kg --sys-prefix
# Git support: https://github.com/jupyterlab/jupyterlab-git
RUN jupyter labextension install @jupyterlab/git && \
pip install jupyterlab-git && \
jupyter serverextension enable --py jupyterlab_git
# HTML support: https://github.com/mflevine/jupyterlab_html
RUN jupyter labextension install @mflevine/jupyterlab_html
# Latex support: https://github.com/jupyterlab/jupyterlab-latex
RUN pip install jupyterlab_latex && \
jupyter labextension install @jupyterlab/latex
USER root
RUN apt update && apt install -y ssh
USER $NB_USER
This exists as eliiza/kernel-gateway-notebook - but i'll also publish the dockerfile to github
Pushed the notebook here: https://github.com/eliiza/kernel-gateway-notebook
@rayh - are you planning on updating enterprise-gateway.yaml with the imagePullPolicy
?
We may be cutting a beta release soon is the reason I'm asking. I'd like to have this in for that release.
When you said above...
I was using a newer version of dockerhub (and so the hub integration didnt work)
Did you mean jupyter hub?
Ahem, yes, I meant JupyterHub (0.7) - the hub extension in the nb2kg image seemed to be too old (and perhaps I was using the wrong tag at the time)
I'll submit a PR for the imagePullPolicy
I'm going to close this issue since PR #525 has been merged and the other items are being addressed/discussed in other issues.
Thank you.
The error from enterprise-gateway is: