kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.79k stars 1.37k forks source link

Integrate Spark operator with Jupyterhub #2180

Open InzamamAnwar opened 1 month ago

InzamamAnwar commented 1 month ago

Please describe your question here

How to integrate Jupyterhub with Spark Operator? Tried to integrate it by installing Jupyterhub and Spark-operator but it's not working

Jupyterhub and Spark operator are running in the same namespace. Attached spark-operator service account to jupyterhub so that it can communicate with Kubernetes APIs. Creating a SparkSession through the following code,

import os
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("JupyterApp")
    .master("k8s://https://kubernetes.default.svc:443")
    .config("spark.submit.deployMode", "client")
    .config("spark.executor.instances", "1")
    .config("spark.executor.memory", "1G")
    .config("spark.driver.memory", "1G")
    .config("spark.executor.cores", "1")
    .config("spark.kubernetes.namespace", "spark-operator")
    .config(
        "spark.kubernetes.container.image", "spark:3.5.0"
    )
    .config("spark.kubernetes.authenticate.driver.serviceAccountName", "spark-operator")
    .getOrCreate()
)

The executors are being creating and killed right after they are created. Cannot see driver pod anywhere. The error we get with the above is given below;

Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalStateException: Spark context stopped while waiting for backend
    at org.apache.spark.scheduler.TaskSchedulerImpl.waitBackendReady(TaskSchedulerImpl.scala:1224)
    at org.apache.spark.scheduler.TaskSchedulerImpl.postStartHook(TaskSchedulerImpl.scala:246)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:694)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
    at py4j.Gateway.invoke(Gateway.java:238)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.base/java.lang.Thread.run(Thread.java:833)

Can anyone please help in this regard?

Provide a link to the example/module related to the question

Additional context

ha2hi commented 1 month ago

Hi,

Do the Python and Spark versions of the "Jupyterhub Serve" and Image "spark:3.5.0" match?

Although written in Korean, I recently tested the "spark:3.5.0" image with Jupyter Notebook. I hope this helps.

[url] https://github.com/ha2hi/spark-study/tree/main/spark-on-k8s/Jupyter-Notebook https://github.com/ha2hi/spark-study/tree/main/spark-on-k8s/Jupyter-Hub