kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.77k stars 1.37k forks source link

Unable to attach or mount volumes #1103

Open laughingman7743 opened 3 years ago

laughingman7743 commented 3 years ago

I am using this operator on GKE. Sometimes the secret volume of the GCP service account specified in the secrets fails to mount and neither the Driver nor the Executor is started,and the application does not run with the status of None. What is the cause? I want to know the solution.

The event log for K8s is as follows.

Events from the 792481ef6ceb265143708d6cc3a3a7985fd7618d-driver Pod #automatic-restart:true ...
1 Scheduled: Successfully assigned namespace-prod/792481ef6ceb265143708d6cc3a3a7985fd7618d-driver to cluster-prod--general-e2-custo-72de20ed-7sq4
1 FailedMount: MountVolume.SetUp failed for volume "spark-conf-volume" : configmap "792481ef6ceb265143708d6cc3a3a7985fd7618d-1607190868090-driver-conf-map" not found
1 FailedMount: MountVolume.SetUp failed for volume "spark-conf-volume" : object "namespace-prod"/"792481ef6ceb265143708d6cc3a3a7985fd7618d-1607190868090-driver-conf-map" not registered

New events emitted by the default-scheduler seen at 2020-12-05 17:54:30 +0000 UTC

Events from the 792481ef6ceb265143708d6cc3a3a7985fd7618d-driver Pod #automatic-restart:true ...
2 FailedMount: MountVolume.SetUp failed for volume "spark-conf-volume" : object "namespace-prod"/"792481ef6ceb265143708d6cc3a3a7985fd7618d-1607190868090-driver-conf-map" not registered

Events emitted by the kubelet seen at 2020-12-05 17:54:31 +0000 UTC

Events from the 792481ef6ceb265143708d6cc3a3a7985fd7618d-driver Pod #automatic-restart:true ...
1 FailedMount: Unable to attach or mount volumes: unmounted volumes=[service-account-volume spark-conf-volume spark-token-kd2p9 spark-local-dir-1], unattached volumes=[service-account-volume spark-conf-volume spark-token-kd2p9 spark-local-dir-1]: timed out waiting for the condition

New events emitted by the kubelet seen at 2020-12-05 17:56:32 +0000 UTC

Events from the 792481ef6ceb265143708d6cc3a3a7985fd7618d-driver Pod #automatic-restart:true ...
1 FailedMount: Unable to attach or mount volumes: unmounted volumes=[service-account-volume spark-conf-volume spark-token-kd2p9 spark-local-dir-1], unattached volumes=[service-account-volume spark-conf-volume spark-token-kd2p9 spark-local-dir-1]: timed out waiting for the condition

New events emitted by the kubelet seen at 2020-12-05 17:56:32 +0000 UTC

The K8s manifesto is as follows.

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: 792481ef6ceb265143708d6cc3a3a7985fd7618d
  namespace: my-project-prod
spec:
  type: Scala
  mode: cluster
  image: "gcr.io/my-project-prod/spark:v2.4.5"
  imagePullPolicy: IfNotPresent
  imagePullSecrets:
  - gcr-image-puller-service-account
  hadoopConf:
    "fs.gs.project.id": "my-project-prod"
    "fs.gs.system.bucket": "my-project-prod-spark"
    "fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
    "fs.AbstractFileSystem.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
    "google.cloud.auth.service.account.enable": "true"
    "google.cloud.auth.service.account.json.keyfile": "/mnt/secrets/service-account.json"
  sparkConf:
    "spark.ui.enabled": "false"
    "spark.executor.memoryOverhead": "2G"
    "spark.speculation": "true"
    "spark.speculation.multiplier": "3"
    "spark.speculation.quantile": "0.9"
    "spark.network.timeout": "300s"
    "spark.kubernetes.local.dirs.tmpfs": "true"
  mainApplicationFile: "gs://my-project-prod-spark/artifacts/myjar-assembly-0.0.1.jar"
  mainClass: xxx.yyy.xxx.Main
  sparkVersion: "2.4.5"
  arguments:
  - "--foo"
  - "bar"
  restartPolicy:
    type: OnFailure
    onFailureRetries: 5
    onFailureRetryInterval: 60
    onSubmissionFailureRetries: 5
    onSubmissionFailureRetryInterval: 60
  driver:
    annotations:
      cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 2.4.5
    serviceAccount: spark
    secrets:
    - name: service-account
      path: "/mnt/secrets"
      secretType: GCPServiceAccount
    envVars:
      GCS_PROJECT_ID: my-project-prod
  executor:
    annotations:
      cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
    cores: 3
    instances: 8
    memory: "8192m"
    labels:
      version: 2.4.5
    secrets:
    - name: service-account
      path: "/mnt/secrets"
      secretType: GCPServiceAccount
    envVars:
      GCS_PROJECT_ID: my-project-prod
  monitoring:
    exposeDriverMetrics: true
    exposeExecutorMetrics: true
    prometheus:
      jmxExporterJar: "/prometheus/jmx_prometheus_javaagent-0.11.0.jar"
      port: 8090
liyinan926 commented 3 years ago

So the mounting only failed sometimes, right? Have you checked the operator logs?

laughingman7743 commented 3 years ago

Sometimes it happens. I do not know under what conditions it reproduces. The operator's log is as follows.


I1205 17:54:30.631496       9 controller.go:265] Ending processing key: "my-project-prod/792481ef6ceb265143708d6cc3a3a7985fd7618d"
W1205 17:54:30.631426       9 submission.go:75] trying to resubmit an already submitted SparkApplication my-project-prod/792481ef6ceb265143708d6cc3a3a7985fd7618d
I1205 17:54:30.482344       9 spark_pod_eventhandler.go:77] Pod 792481ef6ceb265143708d6cc3a3a7985fd7618d-driver deleted in namespace my-project-prod.
I1205 17:54:30.478732       9 spark_pod_eventhandler.go:58] Pod 792481ef6ceb265143708d6cc3a3a7985fd7618d-driver updated in namespace my-project-prod.
I1205 17:54:29.888663       9 spark_pod_eventhandler.go:58] Pod 792481ef6ceb265143708d6cc3a3a7985fd7618d-driver updated in namespace my-project-prod.
I1205 17:54:29.864749       9 spark_pod_eventhandler.go:58] Pod 792481ef6ceb265143708d6cc3a3a7985fd7618d-driver updated in namespace my-project-prod.
I1205 17:54:29.849847       9 spark_pod_eventhandler.go:47] Pod 792481ef6ceb265143708d6cc3a3a7985fd7618d-driver added in namespace my-project-prod.
I1205 17:54:26.077485       9 submission.go:65] spark-submit arguments: [/opt/spark/bin/spark-submit --class com.kouzoh.data.deequ.DataQualityChecker --master k8s://https://10.32.128.1:443 --deploy-mode cluster --conf spark.kubernetes.namespace=my-project-prod --conf spark.app.name=792481ef6ceb265143708d6cc3a3a7985fd7618d --conf spark.kubernetes.driver.pod.name=792481ef6ceb265143708d6cc3a3a7985fd7618d-driver --conf spark.kubernetes.container.image=gcr.io/my-project-prod/spark:v2.4.5 --conf spark.kubernetes.container.image.pullPolicy=IfNotPresent --conf spark.kubernetes.container.image.pullSecrets=gcr-image-puller-service-account --conf spark.kubernetes.submission.waitAppCompletion=false --conf spark.speculation.quantile=0.9 --conf spark.ui.enabled=false --conf spark.speculation=true --conf spark.speculation.multiplier=3 --conf spark.metrics.namespace=my-project-prod.792481ef6ceb265143708d6cc3a3a7985fd7618d --conf spark.executor.memoryOverhead=2G --conf spark.kubernetes.local.dirs.tmpfs=true --conf spark.network.timeout=300s --conf spark.metrics.conf=/etc/metrics/conf/metrics.properties --conf spark.hadoop.fs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS --conf spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem --conf spark.hadoop.fs.gs.project.id=my-project-prod --conf spark.hadoop.fs.gs.system.bucket=my-project-prod-spark --conf spark.hadoop.google.cloud.auth.service.account.enable=true --conf spark.hadoop.google.cloud.auth.service.account.json.keyfile=/mnt/secrets/service-account.json --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=792481ef6ceb265143708d6cc3a3a7985fd7618d --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/submission-id=63d63808-f001-42e6-8dac-352a58b620c0 --conf spark.driver.cores=1 --conf spark.kubernetes.driver.limit.cores=1200m --conf spark.driver.memory=512m --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.driver.label.version=2.4.5 --conf spark.kubernetes.driver.annotation.cluster-autoscaler.kubernetes.io/safe-to-evict=false --conf spark.kubernetes.driver.annotation.prometheus.io/scrape=true --conf spark.kubernetes.driver.annotation.prometheus.io/port=8090 --conf spark.kubernetes.driver.annotation.prometheus.io/path=/metrics --conf spark.driver.extraJavaOptions=-javaagent:/prometheus/jmx_prometheus_javaagent-0.11.0.jar=8090:/etc/metrics/conf/prometheus.yaml --conf spark.kubernetes.driver.secrets.service-account=/mnt/secrets --conf spark.kubernetes.driverEnv.GOOGLE_APPLICATION_CREDENTIALS=/mnt/secrets/key.json --conf spark.kubernetes.driverEnv.GCS_PROJECT_ID=my-project-prod --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=792481ef6ceb265143708d6cc3a3a7985fd7618d --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/submission-id=63d63808-f001-42e6-8dac-352a58b620c0 --conf spark.executor.instances=8 --conf spark.executor.cores=3 --conf spark.executor.memory=8192m --conf spark.kubernetes.executor.label.version=2.4.5 --conf spark.kubernetes.executor.annotation.prometheus.io/scrape=true --conf spark.kubernetes.executor.annotation.prometheus.io/port=8090 --conf spark.kubernetes.executor.annotation.prometheus.io/path=/metrics --conf spark.kubernetes.executor.annotation.cluster-autoscaler.kubernetes.io/safe-to-evict=false --conf spark.executor.extraJavaOptions=-javaagent:/prometheus/jmx_prometheus_javaagent-0.11.0.jar=8090:/etc/metrics/conf/prometheus.yaml --conf spark.kubernetes.executor.secrets.service-account=/mnt/secrets --conf spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS=/mnt/secrets/key.json --conf spark.executorEnv.GCS_PROJECT_ID=my-project-prod gs://my-project-prod-spark/artifacts/myjar-assembly-0.0.1.jar --foo bar]
I1205 17:54:26.058452       9 event.go:274] Event(v1.ObjectReference{Kind:"SparkApplication", Namespace:"my-project-prod", Name:"792481ef6ceb265143708d6cc3a3a7985fd7618d", UID:"818da2e8-bf7d-4d7b-ae56-444515629d59", APIVersion:"sparkoperator.k8s.io/v1beta2", ResourceVersion:"1186417284", FieldPath:""}): type: 'Normal' reason: 'SparkApplicationAdded' SparkApplication 792481ef6ceb265143708d6cc3a3a7985fd7618d was added, enqueuing it for submission
I1205 17:54:26.058211       9 controller.go:258] Starting processing key: "my-project-prod/792481ef6ceb265143708d6cc3a3a7985fd7618d"
I1205 17:54:26.058149       9 controller.go:179] SparkApplication my-project-prod/792481ef6ceb265143708d6cc3a3a7985fd7618d was added, enqueueing it for submission
eosantigen commented 3 years ago

Hi all.

Instead of reopening an identical ticket , I will state here that I also have the same problem.

WebHook Enabled: True Operator version: v1beta2-1.1.1-2.4.5 Environment: AKS

The manifest contains

  volumes:
    - name: stream-checkpoint
      persistentVolumeClaim:
        claimName: stream-checkpoint-datalake
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "2000m"
    labels:
      version: 3.0.0
      app: spark-streams-datalake-tmpbackfill
    serviceAccount: dull-crocodile-sparkoperator
    volumeMounts:
      - name: stream-checkpoint
        mountPath: /stream_checkpoint_dir

It is applied and starts the app as normal. However, on a describe of the driver pod, i get the following:

    Mounts:
      /opt/spark/conf from spark-conf-volume (rw)
      /var/data/spark-bc961e80-ca45-43a2-a2eb-efd11e038126 from spark-local-dir-1 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from dull-crocodile-sparkoperator-token-dzgmw (ro)
Volumes:
  spark-local-dir-1:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  spark-conf-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      spark-streams-datalake-tmpbackfill-1618821902230-driver-conf-map
    Optional:  false
  dull-crocodile-sparkoperator-token-dzgmw:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  dull-crocodile-sparkoperator-token-dzgmw
    Optional:    false

My custom PVC mount nowhere to be found. The Operator logs display absolutely nothing regarding this mount.

kevinchcn commented 3 years ago

hi @eosantigen ,Have you solved your problem? I have the same problem in gke, can you share the solution? thanks!

eosantigen commented 3 years ago

hi @eosantigen ,Have you solved your problem? I have the same problem in gke, can you share the solution? thanks!

Hi , no it has not been solved but I am planning to upgrade to the latest version. Thanks .

kevinchcn commented 3 years ago

Thanks @eosantigen for reply. Have you tried the new version yet?

kevinchcn commented 3 years ago

My issue has been solved. Admission failed to bind port 443. Switching to port 8080 is successful.

laughingman7743 commented 3 years ago

@kevinchcn I have tried updating my operator to the latest version but that has not solved the problem. If you don't mind, could you please share the configuration that solved this problem?