Open tamis-laan opened 1 year ago
@tamis-laan you can do it with the https://github.com/jupyter-server/enterprise_gateway backend for Jupyter. Side note: enterprise-gatway does run on k8s as a backend for your Jupyterhub but it does not follow the k8s best practices. The kernel pods do NOT start a service. I have not been able to get k8s service meshes (istio in my case) running yet. If that's not a requirement for you, your good to go. If you use the istio service mesh and are able to fix it - I'm all ears. :)
Here is my enterprise-gateway istio issue for reference: https://github.com/jupyter-server/enterprise_gateway/issues/1168
@tamis-laan you can do it with the https://github.com/jupyter-server/enterprise_gateway backend for Jupyter. Side note: enterprise-gatway does run on k8s as a backend for your Jupyterhub but it does not follow the k8s best practices. The kernel pods do NOT start a service. I have not been able to get k8s service meshes (istio in my case) running yet. If that's not a requirement for you, your good to go. If you use the istio service mesh and are able to fix it - I'm all ears. :)
Here is my enterprise-gateway istio issue for reference: jupyter-server/enterprise_gateway#1168
I discovered Spark also allows for executing jobs directly on kubernetes: https://spark.apache.org/docs/latest/running-on-kubernetes.html
When you use spark-submit
and you literally point spark at your kubernetes API server k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>
it will starts workers there as pods. Thus when you run jupyterhub in your cluster it should be possible using pyspark to start jobs on your cluster directly.
So I'm not sure how the jupyter enterprise gateway
differs from this setup? Also it says it doesn't provide Jupyter hub so it's not possible to have multiple users with multiple notebooks. Which one is better/preferred?
Jupyterhub extends Jupyter notebooks (see https://zero-to-jupyterhub.readthedocs.io/en/stable/_images/architecture.png).
Thus, Jupyterhub starts a Jupyterlab or plain Jupyter Notebook for your user within a pod and does some management around that. A Jupyter Notebook can have a remote backend (https://jupyter-notebook.readthedocs.io/en/stable/public_server.html#using-a-gateway-server-for-kernel-management) and enterprise-gateway is an implementation for that. enterprise-gateway runs in various resource managers including kubernetes. It acts as some sort of operator that spawns your kernels as pods in kubernetes (allowing for horizontal scaling of your kernel, e.g. python runtime.
I discovered Spark also allows for executing jobs directly on kubernetes: https://spark.apache.org/docs/latest/running-on-kubernetes.html
Yes. The upstream Apache Spark can spawn a driver which in turn spawns N many executors. The executors run your code. However, they do not have an interactive mode. Jupyter Notebooks are interactive in that they spawn a runtime and do not run a script.
When you use spark-submit and you literally point spark at your kubernetes API server k8s://https://
: it will starts workers there as pods. Thus when you run jupyterhub in your cluster it should be possible using pyspark to start jobs on your cluster directly.
See my argument regarding interactivity. AFAIU it won't work.
However, if running individual Spark Jobs from your Jupyter Notebook (instead of your Jupyter Notebook kernel in Spark), check out: https://github.com/TIBCOSoftware/snappy-on-k8s/blob/master/charts/jupyter-with-spark/README.md
@tahesse Thanks for providing a starting point! I'm curious as to how the enterprise gateway kernel would work with the operator though? I understand that enterprise gateway would allow us to run the Jupyter kernel in a k8s cluster, but I'm curious as to how this would enable us to submit jobs using the operator? Would the kernel generate job manifests?
Following -- looking forward to this. how you connect the operator when i want to run this in a large scale with many users? -- Seems that zero-to-jupyterhub are good fit. can i execute the code in Jupyter in cluster mode as well? thanks.
@tahesse Thanks for providing a starting point! I'm curious as to how the enterprise gateway kernel would work with the operator though? I understand that enterprise gateway would allow us to run the Jupyter kernel in a k8s cluster, but I'm curious as to how this would enable us to submit jobs using the operator? Would the kernel generate job manifests?
@Shrinjay WDYM with job manifests? AFAIU enterprise-gateway is an operator itself. Your JupyterHub or JupyterLab (frontend-wise) communicates with enterprise-gateway if configured properly. In the operators kernel-launchers (here: https://github.com/jupyter-server/enterprise_gateway/blob/main/etc/kernel-launchers/operators/scripts/launch_custom_resource.py#L66-L68) it will load the declaration template (https://github.com/jupyter-server/enterprise_gateway/blob/main/etc/kernel-launchers/operators/scripts/sparkoperator.k8s.io-v1beta2.yaml.j2) which will be submitted to kubernetes. The spark job spawns a jupyter kernel (https://github.com/jupyter-server/enterprise_gateway/blob/main/etc/kernel-launchers/python/scripts/launch_ipykernel.py) which keeps communicating via socket with the enterprise-gateway.
Note that they do not spawn a service with the jupyter kernel pod (I thus didn't manage to make it work with istio for that very reason... but I'm out of ideas right now).
I hope my explanation clears up some of the confusions.
Perhaps what you need is PySpark SparkSession via client connect to a Spark Connect Server in Spark 3.4.0 sc mode: https://spark.apache.org/docs/latest/spark-connect-overview.html
I've developed a module for deploying the latest 3.4.0 server-client mode on k8s and support config PySpark Session for direct connections. How about check this out? Wh1isper/sparglim#spark-connect-server-on-k8s
Alternatively, PySpark Session can be deployed in client mode on k8s, also avaliable in https://github.com/Wh1isper/sparglim#pyspark-app
I got this working quit simple.
Following this explainer about running Spark in client mode: https://medium.com/@sephinreji98/understanding-spark-cluster-modes-client-vs-cluster-vs-local-d3c41ea96073
deploy Jupyter Spark manifest
Include an headless service to run in client mode
and provide the spark
service account to the deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: jupyter
labels:
app: jupyter
spec:
replicas: 1
selector:
matchLabels:
app: jupyter
template:
metadata:
labels:
app: jupyter
spec:
containers:
- name: jupyter
image: jupyter/pyspark-notebook:spark-3.5.0
resources:
requests:
memory: 4096Mi
limits:
memory: 4096Mi
env:
- name: JUPYTER_PORT
value: "8888"
ports:
- containerPort: 8888
serviceAccount: spark
serviceAccountName: spark
---
kind: Service
apiVersion: v1
metadata:
name: jupyter
spec:
type: ClusterIP
selector:
app: jupyter
ports:
- protocol: TCP
port: 8888
targetPort: 8888
---
kind: Service
apiVersion: v1
metadata:
name: jupyter-headless
spec:
clusterIP: None
selector:
app: jupyter
You can port-forward the jupyter service on port 8888
and use the access token from the logs.
Connecting to Spark Operator
I got all configs from the documentation:
import os
from pyspark.sql import SparkSession
spark = (
SparkSession.builder.appName("JupyterApp")
.master("k8s://https://kubernetes.default.svc.cluster.local:443")
.config("spark.submit.deployMode", "client")
.config("spark.executor.instances", "1")
.config("spark.executor.memory", "1G")
.config("spark.driver.memory", "1G")
.config("spark.executor.cores", "1")
.config("spark.kubernetes.namespace", "default")
.config(
"spark.kubernetes.container.image", "ghcr.io/apache/spark-docker/spark:3.5.0"
)
.config("spark.kubernetes.authenticate.driver.serviceAccountName", "spark")
.config("spark.kubernetes.driver.pod.name", os.environ["HOSTNAME"])
.config("spark.driver.bindAddress", "0.0.0.0")
.config("spark.driver.host", "jupyter-headless.default.svc.cluster.local")
.getOrCreate()
)
This will create the executor pod with jupyter as the client:
❯ kubectl get po -n default
NAME READY STATUS RESTARTS AGE
jupyter-7495cfdddc-864rd 1/1 Running 0 2m7s
jupyterapp-d3c7258ec363aa87-exec-1 1/1 Running 0 88s
If you have any issue, questions and/or improvements. Let me know!
@JWDobken It works, but doesn't communicate with the operator pod, is that right? I don't even need it running it seems.
@pivettamarcos
but doesn't communicate with the operator pod, is that right?
If I understand correctly, it's using client mode
as said, which is a different mode than submitting tasks, so naturally there's no operator involved.
I've been using this feature since spark 3.1.2, and if you need to build services via spark, this is lower latency than submitting tasks, but more costly to maintain (you may need to design the task's message queue). I built a simple GRPC data sampling service: https://github.com/Wh1isper/pyspark-sampling/ and provide an SDK for using spark in client mode or deploy connect service in k8s: https://github.com/Wh1isper/sparglim
If you want to use a k8s cluster deployed spark remotely in jupyter (via client-server mode), I highly recommend you try https://github.com/Wh1isper/sparglim, given that I haven't seen any official documentation at this point (if there is, with thanks to anyone who can tell me!).
How to delete pod once it is in an Error state? @JWDobken
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
We are running the spark k8s operator in order to process data using the yaml spec in production. This works great but we also want to do exploratory data analyses using Jupyter notebooks. Is this possible using the spark k8s operator?