kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.79k stars 1.37k forks source link

[BUG] imagePullPolicy doesn't work on spark-operator 2.0.0 #2221

Closed missedone closed 1 month ago

missedone commented 1 month ago

Description

I'd like to set the spec.imagePullPolicy: Always, however, the it didn't take effect, and the drive pod yaml shows the value is IfNotPresent

Reproduction Code [Required]

Steps to reproduce the behavior: apply the sample spec

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: test-es-to-s3
spec:
  type: Python
  mode: cluster
  image: 'customized-bitnami-spark:dev'
# here is what i expected to always pull the image
  imagePullPolicy: Always
  sparkVersion: "3.4.3"
  mainApplicationFile: 'local:///opt/spark-apps/src/test_es_to_s3.py'
  sparkConf:
    "spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension"
    "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog"
    "spark.hadoop.fs.s3a.aws.credentials.provider": "com.amazonaws.auth.WebIdentityTokenCredentialsProvider"
    "spark.hadoop.hadoop.security.authentication": "simple"
    "spark.hadoop.hadoop.security.authorization": "false"
    "spark.eventLog.enabled": "true"
    "spark.executorEnv.LD_PRELOAD": "/opt/bitnami/common/lib/libnss_wrapper.so"
  hadoopConf:
    "fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
    "fs.s3a.aws.credentials.provider": "com.amazonaws.auth.WebIdentityTokenCredentialsProvider"
  deps:
    packages:
      - org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1
      - io.delta:delta-core_2.12:2.4.0
      - org.elasticsearch:elasticsearch-spark-30_2.12:8.10.2
      - org.apache.hadoop:hadoop-aws:3.3.1
      - com.amazonaws:aws-java-sdk-bundle:1.11.901

checked the spark-operator log, got info below

2024-10-05T23:20:30.887Z    INFO    sparkapplication/controller.go:716  Running spark-submit for SparkApplication   {"name": "test-es-to-s3", "namespace": "es-to-s3-test", "arguments": ["--master", "k8s://https://10.100.0.1:443", "--deploy-mode", "cluster", "--name", "test-es-to-s3", "--packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1,io.delta:delta-core_2.12:2.4.0,org.elasticsearch:elasticsearch-spark-30_2.12:8.10.2,org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk-bundle:1.11.901", "--conf", "spark.kubernetes.namespace=es-to-s3-test", "--conf", "spark.kubernetes.submission.waitAppCompletion=false", "--conf", "spark.eventLog.dir=s3a://es-to-s3-test/spark-logs/", "--conf", "spark.eventLog.enabled=true", "--conf", "spark.executorEnv.LD_PRELOAD=/opt/bitnami/common/lib/libnss_wrapper.so", "--conf", "spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider", "--conf", "spark.hadoop.hadoop.security.authentication=simple", "--conf", "spark.hadoop.hadoop.security.authorization=false", "--conf", "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog", "--conf", "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension", "--conf", "spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem", "--conf", "spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider", "--conf", "spark.kubernetes.driver.pod.name=test-es-to-s3-driver", "--conf", "spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=test-es-to-s3", "--conf", "spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true", "--conf", "spark.kubernetes.driver.label.sparkoperator.k8s.io/submission-id=9034ae88-656d-44e4-9985-d21fa37c85c5", "--conf", "spark.kubernetes.driver.container.image=customized-bitnami-spark:dev", "--conf", "spark.driver.cores=1", "--conf", "spark.driver.memory=1g", "--conf", "spark.kubernetes.authenticate.driver.serviceAccountName=es-to-s3-test-sa", "--conf", "spark.kubernetes.driver.label.version=3.4.3", "--conf", "spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=test-es-to-s3", "--conf", "spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true", "--conf", "spark.kubernetes.executor.label.sparkoperator.k8s.io/submission-id=9034ae88-656d-44e4-9985-d21fa37c85c5", "--conf", "spark.executor.instances=1", "--conf", "spark.kubernetes.executor.container.image=customized-bitnami-spark:dev", "--conf", "spark.executor.cores=1", "--conf", "spark.executor.memory=2G", "--conf", "spark.kubernetes.authenticate.executor.serviceAccountName=es-to-s3-test-sa", "--conf", "spark.kubernetes.executor.deleteOnTermination=false", "local:///opt/spark-apps/src/test_es_to_s3.py"]}

we can see the imagePullPolicy is missing in the arguments list

Expected behavior

the conf spark.kubernetes.container.image.pullPolicy should be set and pass to driver and executor to always pull the latest image

Actual behavior

we can see the imagePullPolicy is missing in the arguments list

Terminal Output Screenshot(s)

see logs above

Environment & Versions

Additional context

N/A

missedone commented 1 month ago

looks like the return statement is incorrect if, for example, image pull secrets is null https://github.com/kubeflow/spark-operator/blob/release-2.0/internal/controller/sparkapplication/submission.go#L198-L218

vara-bonthu commented 1 month ago

I noticed an issue with the latest version of the Spark Operator. The deployment always defaults to imagePullPolicy: IfNotPresent, and it does not respect the user configuration.

FYI @ChenYi015 @jacobsalway

missedone commented 1 month ago

yes, the issue exists on both master branch and release-2.0 branch, which means in both 1.4.x and 2.0.x version.

PR #2222 fixes the issue and has been verified in our QA environment in AWS EKS