kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.79k stars 1.38k forks source link

Weird behavior: Volumes may not attached to driver Pods after a while #912

Open zzvara opened 4 years ago

zzvara commented 4 years ago

Environment:

CoreOS latest stable (2345.3.0), Kubernetes 1.17.0 installed with Kubespray. Admission plugins: --enable-admission-plugins=NodeRestriction,MutatingAdmissionWebhook

Operator installed:

apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
metadata:
  name: spark-operator
  namespace: experimental
spec:
  chart:
    repository: http://storage.googleapis.com/kubernetes-charts-incubator
    name: sparkoperator
    version: 0.6.9
  releaseName: spark-operator
  values:
    enableWebhook: true
    logLevel: 4

The following job has a volume attached to it:

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-svd-5
  namespace: experimental
spec:
  type: Scala
  mode: cluster
  nodeSelector:
    spark.sztaki.hu: allowed
  image: "zzvara/spark:3.0.0"
  imagePullPolicy: Always
  mainClass: redacted
  mainApplicationFile: redacted
  sparkVersion: "3.0.0"
  hadoopConfigMap: spark-default-configuration
  sparkConf:
    "spark.default.parallelism": "20"
    "spark.executor.extraJavaOptions": "-XX:ParallelGCThreads=10 -XX:ConcGCThreads=10 -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=60"
    "spark.kubernetes.executor.deleteOnTermination": "true"
    "spark.serializer": "org.apache.spark.serializer.KryoSerializer"
    "spark.kryoserializer.buffer.mb": "1024"
    "spark.driver.maxResultSize": "4g"
    # Required since Spark Operator does not support Spark 3.0.0 as of yet.
    "spark.kubernetes.executor.podTemplateFile": "/opt/spark/configuration/executor-template.yaml"
  restartPolicy:
    type: Never
  volumes:
    - name: executor-template
      configMap:
        name: spark-executor-template
  driver:
    cores: 4
    memory: "10288m"
    coreLimit: "6000m"
    labels:
      version: 3.0.0
    serviceAccount: spark-operator-sparkoperator
    volumeMounts:
      - name: executor-template
        mountPath: /opt/spark/configuration/executor-template.yaml
        subPath: executor-template.yaml
  executor:
    cores: 10
    coreLimit: "12000m"
    instances: 4
    memory: "50g"
    labels:
      version: 3.0.0

Examining the driver Pod Spec after job start (submitting the above YAML), the operator sometimes attaches the volume executor-template to the driver Pod, sometimes not. The behavior has some patterns: When the Spark Operator is restarted (its Pod deleted), the Spark Operator will attach the volume to the Pod for about 5-10 times. And then, after consecutive restarts of the SparkApplication (kubectl delete sparkapp && kubectl apply -f ...) will not attach the volume. There are no error logs in the Spark Operator, just that the Spark Driver is missing the executor-template.yaml.

bscaleb commented 7 months ago

I tried what you have here and the SparkApplication fails to submit because the file isn't present on the spark-operator pod. Did you have to also mount the template file on the spark-operator pod?

github-actions[bot] commented 1 day ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.