kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.74k stars 1.36k forks source link

[QUESTION] My SparkApplication won't start after migrating to spark:3.5.0 #2088

Open networkingana opened 1 month ago

networkingana commented 1 month ago

I changed the base image from gcr.io spark:3.1.1 to spark:3.5.0 and built my dockerfile from there where it has my DataAnalyticsReporting.jar

This is the error when the spark-operator tries to start my application

`[root@master-node ~]# kubectl logs data-analytics-reporting-driver
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
Files local:///opt/spark-jars/DataAnalyticsReporting.jar from /opt/spark-jars/DataAnalyticsReporting.jar to /opt/spark-jars/DataAnalyticsReporting.jar
2024-07-19 06:24:51.143 WARN  [main            ] org.apache.spark.network.util.JavaUtils:112  - Attempt to delete using native Unix OS command failed for path = /opt/spark-jars/DataAnalyticsReporting.jar. Falling back to Java IO way
java.io.IOException: Failed to delete: /opt/spark-jars/DataAnalyticsReporting.jar
        at org.apache.spark.network.util.JavaUtils.deleteRecursivelyUsingUnixNative(JavaUtils.java:173) ~[DataAnalyticsReporting.jar:?]
        at org.apache.spark.network.util.JavaUtils.deleteRecursively(JavaUtils.java:109) ~[DataAnalyticsReporting.jar:?]
        at org.apache.spark.network.util.JavaUtils.deleteRecursively(JavaUtils.java:90) ~[DataAnalyticsReporting.jar:?]
        at org.apache.spark.util.SparkFileUtils.deleteRecursively(SparkFileUtils.scala:121) ~[DataAnalyticsReporting.jar:?]
        at org.apache.spark.util.SparkFileUtils.deleteRecursively$(SparkFileUtils.scala:120) ~[DataAnalyticsReporting.jar:?]
        at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1126) ~[DataAnalyticsReporting.jar:?]
        at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$14(SparkSubmit.scala:437) ~[DataAnalyticsReporting.jar:?]
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) ~[DataAnalyticsReporting.jar:?]
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) ~[DataAnalyticsReporting.jar:?]
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) ~[DataAnalyticsReporting.jar:?]
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) ~[DataAnalyticsReporting.jar:?]
        at scala.collection.TraversableLike.map(TraversableLike.scala:286) ~[DataAnalyticsReporting.jar:?]
        at scala.collection.TraversableLike.map$(TraversableLike.scala:279) ~[DataAnalyticsReporting.jar:?]
        at scala.collection.AbstractTraversable.map(Traversable.scala:108) ~[DataAnalyticsReporting.jar:?]
        at org.apache.spark.deploy.SparkSubmit.downloadResourcesToCurrentDirectory$1(SparkSubmit.scala:429) ~[DataAnalyticsReporting.jar:?]
        at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$17(SparkSubmit.scala:453) ~[DataAnalyticsReporting.jar:?]
        at scala.Option.map(Option.scala:230) ~[DataAnalyticsReporting.jar:?]
        at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:453) ~[DataAnalyticsReporting.jar:?]
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:964) ~[DataAnalyticsReporting.jar:?]
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194) ~[DataAnalyticsReporting.jar:?]
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217) ~[DataAnalyticsReporting.jar:?]
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91) ~[DataAnalyticsReporting.jar:?]
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1120) ~[DataAnalyticsReporting.jar:?]
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129) ~[DataAnalyticsReporting.jar:?]
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ~[DataAnalyticsReporting.jar:?]
Exception in thread "main" java.io.IOException: Failed to delete: /opt/spark-jars/DataAnalyticsReporting.jar
        at org.apache.spark.network.util.JavaUtils.deleteRecursivelyUsingJavaIO(JavaUtils.java:146)
        at org.apache.spark.network.util.JavaUtils.deleteRecursively(JavaUtils.java:117)
        at org.apache.spark.network.util.JavaUtils.deleteRecursively(JavaUtils.java:90)
        at org.apache.spark.util.SparkFileUtils.deleteRecursively(SparkFileUtils.scala:121)
        at org.apache.spark.util.SparkFileUtils.deleteRecursively$(SparkFileUtils.scala:120)
        at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1126)
        at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$14(SparkSubmit.scala:437)
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at scala.collection.TraversableLike.map(TraversableLike.scala:286)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
        at scala.collection.AbstractTraversable.map(Traversable.scala:108)
        at org.apache.spark.deploy.SparkSubmit.downloadResourcesToCurrentDirectory$1(SparkSubmit.scala:429)
        at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$17(SparkSubmit.scala:453)
        at scala.Option.map(Option.scala:230)
        at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:453)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:964)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1120)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

` This is my SparkApplication YAML

`apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: data-analytics-reporting
  namespace: default
spec:
  arguments:
  - --clientname
  - data-analytics-reporting-mk
  - --apploglevel
  - trace
  - --rootloglevel
  - trace
  - --kafka
  - my-cluster-kafka-bootstrap.kafka:9092
  - --reportlocation
  - /opt/utms/reports/
  - --reportreqcheckpoint
  - /opt/utms/checkpoints/report-request
  - --reportreqsg
  - spark-dar-mk
  - --dburls
  - k8ssandra-dc2-service.k8ssandra-operator
  - --dbport
  - "9042"
  - --dbuser
  - k8ssandra-superuser
  - --dbpass
  - 
  driver:
    coreLimit: 3000m
    cores: 2
    javaOptions: -Divy.cache.dir=/tmp -Divy.home=/tmp -Dhttps.protocols=TLSv1.2
    labels:
      version: 3.5.0
    memory: 2g
    podSecurityContext:
      fsGroup: 1000
      runAsUser: 1000
    securityContext:
      allowPrivilegeEscalation: false
      runAsUser: 2000
    serviceAccount: my-release-spark-operator
    volumeMounts:
    - mountPath: /opt/spark
      name: spark-conf-volume-driver
    - mountPath: /opt/utms
      name: checkpoints
  executor:
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
            - key: spark-role
              operator: In
              values:
              - executor
            - key: sparkoperator.k8s.io/app-name
              operator: In
              values:
              - data-analytics-reporting
          topologyKey: kubernetes.io/hostname
    cores: 2
    instances: 2
    labels:
      version: 3.5.0
    memory: 3g
    podSecurityContext:
      fsGroup: 1000
      runAsUser: 1000
    securityContext:
      allowPrivilegeEscalation: false
      runAsUser: 2000
    serviceAccount: my-release-spark-operator
    volumeMounts:
    - mountPath: /opt/spark
      name: spark-conf-volume-driver
    - mountPath: /opt/utms
      name: checkpoints
  image: docker.payten.com/spark-application/data-analytics-reporting:1.1.10-newbase-12
  imagePullSecrets:
  - regcred
  mainApplicationFile: local:///opt/spark-jars/DataAnalyticsReporting.jar
  mainClass: com.payten.dar.Main
  mode: cluster
  restartPolicy:
    onFailureRetries: 3
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 5
    onSubmissionFailureRetryInterval: 20
    type: Always
  sparkConf:
    spark.driver.extraClassPath: /opt/spark-jars/*
    spark.driver.extraJavaOptions: -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Duser.timezone=UTC
    spark.executor.extraClassPath: /opt/spark-jars/*
    spark.executor.extraJavaOptions: -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Duser.timezone=UTC
    spark.executor.memoryOverhead: 884M
    spark.memory.offHeap.enabled: "true"
    spark.memory.offHeap.size: 500M
    spark.sql.ui.retainedExecutions: "300"
    spark.streaming.ui.retainedBatches: "300"
    spark.ui.retainedDeadExecutors: "50"
    spark.ui.retainedJobs: "300"
    spark.ui.retainedStages: "300"
    spark.ui.retainedTasks: "500"
    spark.worker.ui.retainedDrivers: "300"
    spark.worker.ui.retainedExecutors: "300"
  sparkVersion: 3.5.0
  type: Java
  volumes:
  - name: checkpoints
    persistentVolumeClaim:
      claimName: utms-1-prod
`

and this is my Dockerfile

`FROM spark:3.5.0

USER root
WORKDIR /opt/spark-jars/
ADD build/libs/DataAnalyticsReporting*.jar ./DataAnalyticsReporting.jar
ADD log4j-core-2.19.0.jar ./log4j-core-2.19.0.jar
ADD log4j-api-2.19.0.jar ./log4j-api-2.19.0.jar
ADD src/main/resources/log4j2.xml ./log4j2.xml
ADD spark-3-rules.yaml ./spark-3-rules.yaml
ADD jmx_prometheus_javaagent-0.11.0.jar ./jmx_prometheus_javaagent-0.11.0.jar

# Change ownership and permissions to spark user
RUN chown spark:spark /opt/spark-jars/* \
    && chmod 777 /opt/spark-jars/*

USER spark

`

I tried to play and change different users, securityContext and podSecurityContext, but with no success, please let me know how should i configure this

milan-dutta commented 1 month ago

Is there any specific reason for setting the working directory to /opt/spark-jars/ and at the same time using the same folder to keep your .jar files? Try changing either the working directory or the folder to keep the .jar files under.

ChenYi015 commented 1 month ago

@networkingana Have you tried running the app with user 185 and group 185, as the spark user's uid and gid are 185.

$ kubectl run --rm --restart=Never -i -t spark --image=spark:3.5.0 -- bash
If you don't see a command prompt, try pressing enter.
spark@spark:/opt/spark/work-dir$ id
uid=185(spark) gid=185(spark) groups=185(spark)
networkingana commented 1 month ago

I resolved this issue with changing my Dockerfile, correcting the permissions and ownership of my files. One strange thing is that my application doesn't print its logs

    spark.driver.extraClassPath: local:///opt/spark-jars/*
    spark.driver.extraJavaOptions: -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Duser.timezone=UTC
      -Dlog4j2.configurationFile=file:///opt/spark-jars/log4j2.xml --add-opens java.base/java.nio=ALL-UNNAMED  --add-opens
      java.base/sun.nio.ch=ALL-UNNAMED --add-opens java.base/jdk.internal.misc=ALL-UNNAMED  --add-opens
      java.base/jdk.internal.ref=ALL-UNNAMED --add-opens java.base/sun.security.ssl=ALL-UNNAMED  --add-opens
      java.base/java.util=ALL-UNNAMED
    spark.executor.extraClassPath: local:///opt/spark-jars/*
    spark.executor.extraJavaOptions: -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dlog4j2.configurationFile=file:///opt/spark-jars/log4j2.xml

i'm using the same log4j2.xml file as with older spark versions

jpthompson23 commented 1 week ago

@networkingana How did you actually fix the problem? "changing my Dockerfile, correcting the permissions and ownership of my files" is too vague. I am encountering the same issue and I am trying to find a solution.

Faivrem commented 4 days ago

Same issue here, do you have any updates on it ?

networkingana commented 4 days ago

I ended up writing my dockerfile like this

FROM spark:3.5.0

WORKDIR /opt/spark-jars/

ADD build/libs/DataAnalytics*.jar ./DataAnalytics.jar
ADD log4j-core-2.19.0.jar ./log4j-core-2.19.0.jar
ADD log4j-api-2.19.0.jar ./log4j-api-2.19.0.jar
ADD src/main/resources/log4j2.xml ./log4j2.xml
ADD spark-3-rules.yaml ./spark-3-rules.yaml
ADD jmx_prometheus_javaagent-0.11.0.jar ./jmx_prometheus_javaagent-0.11.0.jar

# Spark base image uses user and group 185
USER root
RUN chown -R 185:185 /opt/spark-jars && chmod 777 /opt/spark-jars/*

WORKDIR /opt/spark/work-dir
USER spark

I think the key thing was going back to workdir /opt/spark/work-dir and as well owning my jar files by the spark user which is in the spark:3.5.0 base image