kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.78k stars 1.38k forks source link

Could we reconsider involved the spark history server in this repo? #1295

Open nonpool opened 3 years ago

nonpool commented 3 years ago

background: [#164] issue

reason:

  1. Spark history server has not a stable available chart since helm/chart repo archived
  2. As a user of spark-on-k8s-operator, there is a very high probability that Spark history server is needed, because the webUI of spark diver will be inaccessible after the spark executor is completed.

Of course, if you have a better spark history visualization solution, you can also provide it. I really did not find this information in the document.

What do you think?

haolixu commented 3 years ago

There is an online plan: https://github.com/datamechanics/delight

nonpool commented 3 years ago

There is an online plan: https://github.com/datamechanics/delight

Thanks for your solution. This is a great spark history visualization solution.(The visual style is very modern and the configuration is simple and easy to use) But its limitations are also obvious. The visualization page can only be used online, and cannot be deployed by ourselves, so it is not suitable for our scenario.

indranilr commented 3 years ago

You could write your own deployment and run history server using ./sbin/start-history-server.sh, I have located an alternative hosting of the charts at https://artifacthub.io/packages/helm/spot/spark-history-server which might help.

jdonnelly-apixio commented 3 years ago

I am writing spark events to s3, so I build a new docker container adding a couple jars and just change the entry point to run spark history server.

FROM gcr.io/spark-operator/spark:v3.1.1-hadoop3

USER root

ADD https://xxx.com/artifactory/apixio-spark/org/apache/hadoop/hadoop-aws/2.7.4/hadoop-aws-2.7.4.jar $SPARK_HOME/jars/
ADD https://xxx.com/artifactory/apixio-spark/com/amazonaws/aws-java-sdk-bundle/1.7.4.2/aws-java-sdk-1.7.4.2.jar $SPARK_HOME/jars/

ENTRYPOINT bash /opt/spark/sbin/spark-daemon.sh start org.apache.spark.deploy.history.HistoryServer 1

Then just a deployment yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: spark-hs-custom
    version: 3.1.1
  name: spark-hs-custom
spec:
  replicas: 1
  selector:
    matchLabels:
      app: spark-hs-custom
      version: 3.1.1
  template:
    metadata:
      labels:
        app: spark-hs-custom
        version: 3.1.1
    spec:
      containers:
        - env:
          - name: SPARK_NO_DAEMONIZE
            value: "false"
          - name: SPARK_HISTORY_OPTS
            value: -Dspark.history.fs.logDirectory=s3a://my-bucket-name/eventLogFolder
          image: xxx.dkr.ecr.us-west-2.amazonaws.com/xxx-spark-hs:v0.0.4
          imagePullPolicy: IfNotPresent
          name: spark-hs-custom
          ports:
            - containerPort: 18080
              name: http
              protocol: TCP
          resources:
            requests:
              cpu: "2"
              memory: 10Gi
            limits:
              cpu: "2"
              memory: 10Gi
stephbat commented 3 years ago

@jdonnelly-apixio : I tried your solution (with a USER different than root) and I get this error in the logs of the spark history server Exception in thread "main" java.io.IOException: failure to login at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:841) I think that a user must be created in the docker image

jdonnelly-apixio commented 3 years ago

@stephbat Yea, I think I hit that issue as well when I was running as someone other than root. Google's base image uses the user 185, but I wasn't able to get it to work with that one. Can you post what you did if you figure out a solution?

jdonnelly-apixio commented 3 years ago

yea, confirmed I get that exception when I try to do a USER 185

21/07/13 22:00:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" org.apache.hadoop.security.KerberosAuthException: failure to login: javax.security.auth.login.LoginException: java.lang.NullPointerException: invalid null input: name
        at jdk.security.auth/com.sun.security.auth.UnixPrincipal.<init>(Unknown Source)
        at jdk.security.auth/com.sun.security.auth.module.UnixLoginModule.login(Unknown Source)
        at java.base/javax.security.auth.login.LoginContext.invoke(Unknown Source)
        at java.base/javax.security.auth.login.LoginContext$4.run(Unknown Source)
        at java.base/javax.security.auth.login.LoginContext$4.run(Unknown Source)
        at java.base/java.security.AccessController.doPrivileged(Native Method)
        at java.base/javax.security.auth.login.LoginContext.invokePriv(Unknown Source)
        at java.base/javax.security.auth.login.LoginContext.login(Unknown Source)
        at org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:1926)
        at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1837)
        at org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:710)
        at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:660)
        at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571)
        at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2476)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2476)
        at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:79)
        at org.apache.spark.deploy.history.HistoryServer$.createSecurityManager(HistoryServer.scala:333)
        at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:294)
        at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)

        at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1847)
        at org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:710)
        at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:660)
        at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571)
        at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2476)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2476)
        at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:79)
        at org.apache.spark.deploy.history.HistoryServer$.createSecurityManager(HistoryServer.scala:333)
        at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:294)
        at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
Caused by: javax.security.auth.login.LoginException: java.lang.NullPointerException: invalid null input: name
        at jdk.security.auth/com.sun.security.auth.UnixPrincipal.<init>(Unknown Source)
        at jdk.security.auth/com.sun.security.auth.module.UnixLoginModule.login(Unknown Source)
        at java.base/javax.security.auth.login.LoginContext.invoke(Unknown Source)
        at java.base/javax.security.auth.login.LoginContext$4.run(Unknown Source)
        at java.base/javax.security.auth.login.LoginContext$4.run(Unknown Source)
        at java.base/java.security.AccessController.doPrivileged(Native Method)
        at java.base/javax.security.auth.login.LoginContext.invokePriv(Unknown Source)
        at java.base/javax.security.auth.login.LoginContext.login(Unknown Source)
        at org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:1926)
        at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1837)
        at org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:710)
        at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:660)
        at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571)
        at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2476)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2476)
        at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:79)
        at org.apache.spark.deploy.history.HistoryServer$.createSecurityManager(HistoryServer.scala:333)
        at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:294)
        at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)

        at java.base/javax.security.auth.login.LoginContext.invoke(Unknown Source)
        at java.base/javax.security.auth.login.LoginContext$4.run(Unknown Source)
        at java.base/javax.security.auth.login.LoginContext$4.run(Unknown Source)
        at java.base/java.security.AccessController.doPrivileged(Native Method)
        at java.base/javax.security.auth.login.LoginContext.invokePriv(Unknown Source)
        at java.base/javax.security.auth.login.LoginContext.login(Unknown Source)
        at org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:1926)
        at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1837)
        ... 10 more
jdonnelly-apixio commented 3 years ago

This works for me:

FROM gcr.io/spark-operator/spark:v3.1.1-hadoop3

USER root

ADD https://xxx.com/artifactory/apixio-spark/org/apache/hadoop/hadoop-aws/2.7.4/hadoop-aws-2.7.4.jar $SPARK_HOME/jars/
ADD https://xxx.com/artifactory/apixio-spark/com/amazonaws/aws-java-sdk-bundle/1.7.4.2/aws-java-sdk-1.7.4.2.jar $SPARK_HOME/jars/

RUN groupadd -g 185 spark && \
    useradd -u 185 -g 185 spark

USER 185
ENTRYPOINT bash /opt/spark/sbin/spark-daemon.sh start org.apache.spark.deploy.history.HistoryServer 1
nonpool commented 3 years ago

@jdonnelly-apixio @indranilr thank for your solution. In fact, I also use deployment to deploy Spark history server just like yours. But what I want to say is that we can deploy Spark history server ourselves is completely different from being included in this helm repo.

I think we should simply change the vaules.yaml to make his work well.

jdonnelly-apixio commented 3 years ago

@nonpool yep, kind of agreed. it would be useful if the spark-operator supported the deployment of common services like spark history server, a hive metastore, prometheus server, etc.

can a helm chart install other helm charts (if not, not sure it would make sense to duplicate helm install functionality for stuff like prometheus server inside of the spark-operator helm chart and the official prometheus-community helm chart should probably be used instead)? not sure what best practices would be from a helm stand point... maybe just some additional documentation that shows or links to how to install some common useful additional services would be a good start

chetkhatri commented 2 years ago

Hi @stephbat @indranilr @nonpool @jdonnelly-apixio @haolixu I am as of now confused how to install Spark History Server on kubernetes after successful installation of the Spark Operator on Kubernetes?

oleksiilopasov commented 2 years ago

@chetkhatri Here is my personal experience:

  1. Build an image with the following Dockerfile:
    
    ARG SPARK_IMAGE=gcr.io/spark-operator/spark:v3.1.1-hadoop3
    FROM ${SPARK_IMAGE}

USER root

Setup dependencies for Google Cloud Storage access.

RUN rm $SPARK_HOME/jars/guava-*.jar ADD https://repo1.maven.org/maven2/com/google/guava/guava/27.0-jre/guava-27.0-jre.jar $SPARK_HOME/jars RUN chmod 644 $SPARK_HOME/jars/guava-27.0-jre.jar

Add the connector jar needed to access Google Cloud Storage using the Hadoop FileSystem API.

ADD https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop3.jar $SPARK_HOME/jars RUN chmod 644 $SPARK_HOME/jars/gcs-connector-latest-hadoop3.jar

USER ${spark_uid}

ENTRYPOINT ${SPARK_HOME}/sbin/start-history-server.sh

I've added some stuff related to GCP. Feel free to add your own for AWS/Azure whatever...
2. Create a new chart with `helm create`
3. Change the following in `templates/deployment.yaml`:

- `spec.template.spec.containers[0].ports.containerPort` -> 18080
- `spec.template.spec.containers[0].env` -> 

env:

liangchen-datanerd commented 2 years ago

This works for me:

FROM gcr.io/spark-operator/spark:v3.1.1-hadoop3

USER root

ADD https://xxx.com/artifactory/apixio-spark/org/apache/hadoop/hadoop-aws/2.7.4/hadoop-aws-2.7.4.jar $SPARK_HOME/jars/
ADD https://xxx.com/artifactory/apixio-spark/com/amazonaws/aws-java-sdk-bundle/1.7.4.2/aws-java-sdk-1.7.4.2.jar $SPARK_HOME/jars/

RUN groupadd -g 185 spark && \
    useradd -u 185 -g 185 spark

USER 185
ENTRYPOINT bash /opt/spark/sbin/spark-daemon.sh start org.apache.spark.deploy.history.HistoryServer 1

awesome,this works, but i have no idea how does it works

gFazzari commented 1 year ago

Hi everyone, I've one question: are you also able to collect driver/executor stdout/stderr logs that you can see through kubectl logs?

vrd83 commented 9 months ago

Has anyone tried to do this more recently? I'm trying with the apache/spark:3.5.0 container image and have lost a day to dependency hell. A Spark History helm chart in this repo or documentation on how to get this up and running would be very welcome indeed.

Cian911 commented 4 months ago

What is your securityContext set to on your deployment for the history server? In my case, I had the same issue as @jdonnelly-apixio here, but the real issue was that my securityContext was set to run as a different user.

The fix, in my case, was simply:

securityContext:
  runAsUser: 185

As previously it was set to run as user 1000. This fixed the issue for me.

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tekumara commented 1 month ago

Bump to keep alive as this would be useful.