kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.74k stars 1.36k forks source link

Spark operator + jupyter notebook? #1652

Open tamis-laan opened 1 year ago

tamis-laan commented 1 year ago

We are running the spark k8s operator in order to process data using the yaml spec in production. This works great but we also want to do exploratory data analyses using Jupyter notebooks. Is this possible using the spark k8s operator?

tafaust commented 1 year ago

@tamis-laan you can do it with the https://github.com/jupyter-server/enterprise_gateway backend for Jupyter. Side note: enterprise-gatway does run on k8s as a backend for your Jupyterhub but it does not follow the k8s best practices. The kernel pods do NOT start a service. I have not been able to get k8s service meshes (istio in my case) running yet. If that's not a requirement for you, your good to go. If you use the istio service mesh and are able to fix it - I'm all ears. :)

Here is my enterprise-gateway istio issue for reference: https://github.com/jupyter-server/enterprise_gateway/issues/1168

tamis-laan commented 1 year ago

@tamis-laan you can do it with the https://github.com/jupyter-server/enterprise_gateway backend for Jupyter. Side note: enterprise-gatway does run on k8s as a backend for your Jupyterhub but it does not follow the k8s best practices. The kernel pods do NOT start a service. I have not been able to get k8s service meshes (istio in my case) running yet. If that's not a requirement for you, your good to go. If you use the istio service mesh and are able to fix it - I'm all ears. :)

Here is my enterprise-gateway istio issue for reference: jupyter-server/enterprise_gateway#1168

I discovered Spark also allows for executing jobs directly on kubernetes: https://spark.apache.org/docs/latest/running-on-kubernetes.html

When you use spark-submit and you literally point spark at your kubernetes API server k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> it will starts workers there as pods. Thus when you run jupyterhub in your cluster it should be possible using pyspark to start jobs on your cluster directly.

So I'm not sure how the jupyter enterprise gateway differs from this setup? Also it says it doesn't provide Jupyter hub so it's not possible to have multiple users with multiple notebooks. Which one is better/preferred?

tafaust commented 1 year ago

Jupyterhub extends Jupyter notebooks (see https://zero-to-jupyterhub.readthedocs.io/en/stable/_images/architecture.png).

Thus, Jupyterhub starts a Jupyterlab or plain Jupyter Notebook for your user within a pod and does some management around that. A Jupyter Notebook can have a remote backend (https://jupyter-notebook.readthedocs.io/en/stable/public_server.html#using-a-gateway-server-for-kernel-management) and enterprise-gateway is an implementation for that. enterprise-gateway runs in various resource managers including kubernetes. It acts as some sort of operator that spawns your kernels as pods in kubernetes (allowing for horizontal scaling of your kernel, e.g. python runtime.

I discovered Spark also allows for executing jobs directly on kubernetes: https://spark.apache.org/docs/latest/running-on-kubernetes.html

Yes. The upstream Apache Spark can spawn a driver which in turn spawns N many executors. The executors run your code. However, they do not have an interactive mode. Jupyter Notebooks are interactive in that they spawn a runtime and do not run a script.

When you use spark-submit and you literally point spark at your kubernetes API server k8s://https://: it will starts workers there as pods. Thus when you run jupyterhub in your cluster it should be possible using pyspark to start jobs on your cluster directly.

See my argument regarding interactivity. AFAIU it won't work.

However, if running individual Spark Jobs from your Jupyter Notebook (instead of your Jupyter Notebook kernel in Spark), check out: https://github.com/TIBCOSoftware/snappy-on-k8s/blob/master/charts/jupyter-with-spark/README.md

Shrinjay commented 1 year ago

@tahesse Thanks for providing a starting point! I'm curious as to how the enterprise gateway kernel would work with the operator though? I understand that enterprise gateway would allow us to run the Jupyter kernel in a k8s cluster, but I'm curious as to how this would enable us to submit jobs using the operator? Would the kernel generate job manifests?

avishayse commented 1 year ago

Following -- looking forward to this. how you connect the operator when i want to run this in a large scale with many users? -- Seems that zero-to-jupyterhub are good fit. can i execute the code in Jupyter in cluster mode as well? thanks.

tafaust commented 1 year ago

@tahesse Thanks for providing a starting point! I'm curious as to how the enterprise gateway kernel would work with the operator though? I understand that enterprise gateway would allow us to run the Jupyter kernel in a k8s cluster, but I'm curious as to how this would enable us to submit jobs using the operator? Would the kernel generate job manifests?

@Shrinjay WDYM with job manifests? AFAIU enterprise-gateway is an operator itself. Your JupyterHub or JupyterLab (frontend-wise) communicates with enterprise-gateway if configured properly. In the operators kernel-launchers (here: https://github.com/jupyter-server/enterprise_gateway/blob/main/etc/kernel-launchers/operators/scripts/launch_custom_resource.py#L66-L68) it will load the declaration template (https://github.com/jupyter-server/enterprise_gateway/blob/main/etc/kernel-launchers/operators/scripts/sparkoperator.k8s.io-v1beta2.yaml.j2) which will be submitted to kubernetes. The spark job spawns a jupyter kernel (https://github.com/jupyter-server/enterprise_gateway/blob/main/etc/kernel-launchers/python/scripts/launch_ipykernel.py) which keeps communicating via socket with the enterprise-gateway.

Note that they do not spawn a service with the jupyter kernel pod (I thus didn't manage to make it work with istio for that very reason... but I'm out of ideas right now).

I hope my explanation clears up some of the confusions.

Wh1isper commented 1 year ago

Perhaps what you need is PySpark SparkSession via client connect to a Spark Connect Server in Spark 3.4.0 sc mode: https://spark.apache.org/docs/latest/spark-connect-overview.html

I've developed a module for deploying the latest 3.4.0 server-client mode on k8s and support config PySpark Session for direct connections. How about check this out? Wh1isper/sparglim#spark-connect-server-on-k8s

Alternatively, PySpark Session can be deployed in client mode on k8s, also avaliable in https://github.com/Wh1isper/sparglim#pyspark-app

JWDobken commented 4 months ago

I got this working quit simple.

Following this explainer about running Spark in client mode: https://medium.com/@sephinreji98/understanding-spark-cluster-modes-client-vs-cluster-vs-local-d3c41ea96073

deploy Jupyter Spark manifest

Include an headless service to run in client mode and provide the spark service account to the deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: jupyter
  labels:
    app: jupyter
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jupyter
  template:
    metadata:
      labels:
        app: jupyter
    spec:
      containers:
        - name: jupyter
          image: jupyter/pyspark-notebook:spark-3.5.0
          resources:
            requests:
              memory: 4096Mi
            limits:
              memory: 4096Mi
          env:
            - name: JUPYTER_PORT
              value: "8888"
          ports:
            - containerPort: 8888
      serviceAccount: spark
      serviceAccountName: spark
---
kind: Service
apiVersion: v1
metadata:
  name: jupyter
spec:
  type: ClusterIP
  selector:
    app: jupyter
  ports:
    - protocol: TCP
      port: 8888
      targetPort: 8888
---
kind: Service
apiVersion: v1
metadata:
  name: jupyter-headless
spec:
  clusterIP: None
  selector:
    app: jupyter

You can port-forward the jupyter service on port 8888 and use the access token from the logs.

Connecting to Spark Operator

I got all configs from the documentation:

import os
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("JupyterApp")
    .master("k8s://https://kubernetes.default.svc.cluster.local:443")
    .config("spark.submit.deployMode", "client")
    .config("spark.executor.instances", "1")
    .config("spark.executor.memory", "1G")
    .config("spark.driver.memory", "1G")
    .config("spark.executor.cores", "1")
    .config("spark.kubernetes.namespace", "default")
    .config(
        "spark.kubernetes.container.image", "ghcr.io/apache/spark-docker/spark:3.5.0"
    )
    .config("spark.kubernetes.authenticate.driver.serviceAccountName", "spark")
    .config("spark.kubernetes.driver.pod.name", os.environ["HOSTNAME"])
    .config("spark.driver.bindAddress", "0.0.0.0")
    .config("spark.driver.host", "jupyter-headless.default.svc.cluster.local")
    .getOrCreate()
)

This will create the executor pod with jupyter as the client:

❯ kubectl get po -n default
NAME                                 READY   STATUS      RESTARTS   AGE
jupyter-7495cfdddc-864rd             1/1     Running     0          2m7s
jupyterapp-d3c7258ec363aa87-exec-1   1/1     Running     0          88s

If you have any issue, questions and/or improvements. Let me know!

pivettamarcos commented 4 months ago

@JWDobken It works, but doesn't communicate with the operator pod, is that right? I don't even need it running it seems.

Wh1isper commented 4 months ago

@pivettamarcos

but doesn't communicate with the operator pod, is that right?

If I understand correctly, it's using client mode as said, which is a different mode than submitting tasks, so naturally there's no operator involved.

I've been using this feature since spark 3.1.2, and if you need to build services via spark, this is lower latency than submitting tasks, but more costly to maintain (you may need to design the task's message queue). I built a simple GRPC data sampling service: https://github.com/Wh1isper/pyspark-sampling/ and provide an SDK for using spark in client mode or deploy connect service in k8s: https://github.com/Wh1isper/sparglim

If you want to use a k8s cluster deployed spark remotely in jupyter (via client-server mode), I highly recommend you try https://github.com/Wh1isper/sparglim, given that I haven't seen any official documentation at this point (if there is, with thanks to anyone who can tell me!).

parthweprom commented 4 weeks ago

How to delete pod once it is in an Error state? @JWDobken