kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.79k stars 1.38k forks source link

Config Maps and Volumes are not getting mounted #1619

Open infa-madhanb opened 2 years ago

infa-madhanb commented 2 years ago

Our GKE cluster is running on Kubernetes version v1.21.14. Pods were running all good until yesterday, now Configmaps and Volumes are not getting Mounted.

Deployment Mode: Helm Chart

Helm Chart Version: 1.1.0

Image: v1beta2-1.2.3-3.1.1

Kubernetes Version: 1.21.14

Helm command to install: helm install spark-operator --namespace *** --set image.tag= v1beta2-1.2.3-3.1.1 --set webhook.enable=true -f values.yaml

Spark operator pod starts successfully after webhook-init pod get completed.

But my application pod using Spark Operator is unable to come up due to below error:

Events: │ │ Type Reason Age From Message │ │ ---- ------ ---- ---- ------- │ │ Normal Scheduled 49m default-scheduler Successfully assigned pod/re-driver to gke │ │ w--taints-6656b326-49of │ │ Warning FailedMount 49m kubelet MountVolume.SetUp failed for volume "spark-conf-volume-driver" : configmap "spark-drv-8c0f12839ca69805-conf-map" not found │ │ Warning FailedMount 27m (x3 over 40m) kubelet Unable to attach or mount volumes: unmounted volumes=[re-checkpoint], unattached volumes=[spark-conf-volume-driver kube-api-acce │ │ s-lsflz re-checkpoint app-conf-vol cert-secret-volume spark-local-dir-1[]: timed out waiting for the condition │ │ Warning FailedMount 20m (x2 over 45m) kubelet Unable to attach or mount volumes: unmounted volumes=[re-checkpoint], unattached volumes=[kube-api-access-lsflz re-checkpoint app-conf-vol cert-secret-volume spark-local-dir-1 spark-conf-volume-driver[]: timed out waiting for the condition

jiamin13579 commented 2 years ago

Is there a solution?

Fiorellaps commented 2 years ago

Any update?

Elsayed91 commented 1 year ago

MountVolume.SetUp failed for volume "spark-conf-volume-driver" : configmap "spark-drv-a4a28f849e410e3b-conf-map" not found FailedMount

happens with default settings.

jnkroeker commented 1 year ago

Seeing exact error mentioned by @Liftingthedata as well, even after attaching the cluster-admin ClusterRole to the ServiceAccount created by chart installation through new ClusterRoleBinding and specifying the aforementioned ServiceAccount in examples/spark-pi.yaml.

Executingkubectl describe sparkapplication spark-pi --namespace <your ns> reveals that it is the spark-pi-driver failing. Inspecting the spark-pi-driver pod with kubectl describe pod spark-pi-driver --namespace <your ns> shows the kubelet MountVolume.SetUp failure message directly after pulling image "gcr.io/spark-operator/spark:v3.1.1". Is this perhaps an error within the image being pulled?

Please help!

pradithya commented 1 year ago

I encountered similar issue. The problem frequently happen when the spark operator is under provisioned or under high load.

jnkroeker commented 1 year ago

@pradithya could you please share your node configuration?

sunnysmane commented 1 year ago

Is there any solution for this? I am also facing the similar issue. I am not sure but what I think is driver pod is trying to mount configmap before it is getting created which shows that configmap not found.

Warning FailedMount 66s kubelet MountVolume.SetUp failed for volume "spark-conf-volume-driver" : configmap "spark-drv-f42632859b918eee-conf-map" not found

RSKriegs commented 1 year ago

I've began to encounter it constantly once I've introduced environment variables to my Scala script (and properly modified K8s manifest). Haven't found out how to solve it yet.

EDIT: quite late update, but this were mainly memory leaks (insufficient resources) on my side :)

ericklcl commented 1 year ago

hi guys, any solution for this issue ?

GaryLouisStewart commented 1 year ago

Hi All,

Experienced this when we migrated to the helm chart installation of the spark operator, our volumes were mounted correctly via configmaps, but kubernetes was erroring out..

Make sure you have the following settings enabled on the helm chart below...

webhook:
  # -- Enable webhook server
  enable: true
  namespaceSelector: "spark-webhook-enabled=true"

Then label the spark namespace (or target namespace for your spark jobs) with:

spark-webhook-enabled=true

We found that was enough to get it working.

Application side

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: ScheduledSparkApplication
metadata:
  name: example-test
  namespace: spark
spec:
  schedule: "31 12 * * *"
  concurrencyPolicy: Allow
  template:
    timeToLiveSeconds: 1200
    type: Python
    arguments:
      - --config-file-name=/opt/spark/work-dir/config/config.ini
    sparkConf:
      spark.kubernetes.decommission.script: "/opt/decom.sh"
      .  . .
    hadoopConf:
      fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
      . . .
    mode: cluster
    imagePullPolicy: Always
    mainApplicationFile: local:///opt/spark/work-dir/run.py
    sparkVersion: "3.2.1"
    restartPolicy:
        type: Never
    driver:
      cores: 1
      coreLimit: "500m"
      memory: "500m"
      labels:
        version: 3.2.1
      serviceAccount: job-tole
      volumeMounts:
        - name: "config"
          mountPath: "/opt/spark/work-dir/config"
    executor:
      cores: 1
      instances: 1
      memory: "500m"
      labels:
        version: 3.2.1
      volumeMounts:
        - name: "config"
          mountPath: "/opt/spark/work-dir/config"
    volumes:
      - name: "config"
        configMap:
          name: "config"
          items:
            - key: "config.ini"
              path: "config.ini" 

Please note that: If you don't use the helm chart you still need to enable the webhook, otherwise the spark-operator won't be able to create the right configmaps and volume mounts on the driver and executor pods when they are spawned.

JunaidChaudry commented 1 year ago

I have enabled the webhook, using the namespaceSelector with the correct selectors, and still having the issue. Any idea? I am using the latest version. I have also tried enabling webhook on all namespaces, but still face the same issue.

I am unable to use tolerations as well

balkrishan333 commented 1 year ago

Does not work for me also. Followed the steps mentioned above but still getting same error.

balkrishan333 commented 1 year ago

I have enabled the webhook, using the namespaceSelector with the correct selectors, and still having the issue. Any idea? I am using the latest version. I have also tried enabling webhook on all namespaces, but still face the same issue.

I am unable to use tolerations as well

Did you manage to solve this issue?

percymehta commented 1 year ago

The spark-pi job just hangs either before the driver is initialized or after driver the starts running. I do see the config-map load error in the driver's events but it does get created afterwards. Is this a resource problem? I'm running this on minikube with 4 CPU and 8GB memory !!

JunaidChaudry commented 1 year ago

My issue was the webhook port.. It for some reason doesn't run on the default port anymore. So I had to update the port to 443 based on the docs here, even though I'm on EKS instead of GKE

balkrishan333 commented 1 year ago

My issue was the webhook port.. It for some reason doesn't run on the default port anymore. So I had to update the port to 443 based on the docs here, even though I'm on EKS instead of GKE

Thank you. I am using AKS and it worked for me as well.

davidmirror-ops commented 11 months ago

I've done everything mentioned here with no success

jalkjaer commented 9 months ago

The webhook only adds volumes if the driver/executor has a volumeMount for them: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/pkg/webhook/patch.go#L138-L143. The same goes for configMaps https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/pkg/webhook/patch.go#L335C1-L339

The code doesn't check if a driver/executor initContainer/sidecar mounts the vols. As a workaround you just have to add the volumeMounts directly to the driver/executor spec as well

BCantos17 commented 6 months ago

Bump, is there any other possible solution? I have tried all the above with no success. Giving my configurations if it is of any help. Using helm chart 1.1.27 and v1beta2-1.3.8-3.1.1

values.yaml

# https://github.com/kubeflow/spark-operator/tree/master/charts/spark-operator-chart
nameOverride: spark-operator
fullnameOverride: spark-operator

image:
  # -- Image repository
  repository: ghcr.io/googlecloudplatform/spark-operator
  # -- Image pull policy
  pullPolicy: IfNotPresent
  # -- if set, override the image tag whose default is the chart appVersion.
  tag: "v1beta2-1.3.8-3.1.1"

imagePullSecrets: 
  - name: regcred

sparkJobNamespace: spark-operator

resources:
  limits:
    cpu: 1
    memory: 512Mi
  requests:
    cpu: 1
    memory: 512Mi

webhook:
  enable: true
  port: 443
  namespaceSelector: "spark-webhook-enabled=true"

SparkApplication manifest

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-pi2
  namespace: spark-operator
spec:
  type: Scala
  mode: cluster
  image: "apache/spark:3.4.2"
  imagePullPolicy: IfNotPresent
  imagePullSecrets:
    - regcred
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.4.2.jar"
  sparkVersion: "3.4.2"
  timeToLiveSeconds: 600
  restartPolicy:
    type: Never
  volumes:
    - name: config-vol
      configMap:
        name: cm-spark-extra
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 3.4.2
    serviceAccount: airflow-next
    volumeMounts:
      - name: config-vol
        mountPath: /mnt/cm-spark-extra
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 3.4.2
    volumeMounts:
      - name: config-vol
        mountPath: /mnt/cm-spark-extra

Here is the container and volume spec of the pod being spun up

spec:
  volumes:
    - name: aws-iam-token
      projected:
        sources:
          - serviceAccountToken:
              audience: sts.amazonaws.com
              expirationSeconds: 86400
              path: token
        defaultMode: 420
    - name: spark-local-dir-1
      emptyDir: {}
    - name: spark-conf-volume-driver
      configMap:
        name: spark-drv-b14d8f8f2b497a58-conf-map
        items:
          - key: spark.properties
            path: spark.properties
            mode: 420
        defaultMode: 420
    - name: kube-api-access-tf4pb
      projected:
        sources:
          - serviceAccountToken:
              expirationSeconds: 3607
              path: token
          - configMap:
              name: kube-root-ca.crt
              items:
                - key: ca.crt
                  path: ca.crt
          - downwardAPI:
              items:
                - path: namespace
                  fieldRef:
                    apiVersion: v1
                    fieldPath: metadata.namespace
        defaultMode: 420
  containers:
    - name: spark-kubernetes-driver
      image: apache/spark:3.4.2
      args:
        - driver
        - '--properties-file'
        - /opt/spark/conf/spark.properties
        - '--class'
        - org.apache.spark.examples.SparkPi
        - local:///opt/spark/examples/jars/spark-examples_2.12-3.4.2.jar
      ports:
        - name: driver-rpc-port
          containerPort: 7078
          protocol: TCP
        - name: blockmanager
          containerPort: 7079
          protocol: TCP
        - name: spark-ui
          containerPort: 4040
          protocol: TCP
      env:
        - name: SPARK_USER
          value: root
        - name: SPARK_APPLICATION_ID
          value: spark-2d80cebdab33400b83cbfe61fd09faee
        - name: SPARK_DRIVER_BIND_ADDRESS
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        - name: SPARK_LOCAL_DIRS
          value: /var/data/spark-1505b0f4-c95a-4bba-aa76-9451b34af9ea
        - name: SPARK_CONF_DIR
          value: /opt/spark/conf
        - name: AWS_STS_REGIONAL_ENDPOINTS
          value: regional
        - name: AWS_DEFAULT_REGION
          value: us-east-1
        - name: AWS_REGION
          value: us-east-1
        - name: AWS_ROLE_ARN
          value: arn:aws:iam::123456:role/my-sa
        - name: AWS_WEB_IDENTITY_TOKEN_FILE
          value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
      resources:
        limits:
          cpu: 1200m
          memory: 896Mi
        requests:
          cpu: '1'
          memory: 896Mi
      volumeMounts:
        - name: spark-local-dir-1
          mountPath: /var/data/spark-1505b0f4-c95a-4bba-aa76-9451b34af9ea
        - name: spark-conf-volume-driver
          mountPath: /opt/spark/conf
        - name: kube-api-access-tf4pb
          readOnly: true
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        - name: aws-iam-token
          readOnly: true
          mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      imagePullPolicy: IfNotPresent

I am at a lost here. spark-conf-volume-driver is being set up as spark-drv-b14d8f8f2b497a58-conf-map and I see it under my kubernetes configmaps. I am on EKS and set webhook.enable to true and the port to 443. I did the work around configured a volumeMount under both the driver and executor although I do not see this being configured under my pod however I do not think this matters? As far as I know, everything is configured correctly. Is there somebody that can help?

luis-fnogueira commented 6 months ago

Bump, is there any other possible solution? I have tried all the above with no success. Giving my configurations if it is of any help. Using helm chart 1.1.27 and v1beta2-1.3.8-3.1.1

values.yaml

# https://github.com/kubeflow/spark-operator/tree/master/charts/spark-operator-chart
nameOverride: spark-operator
fullnameOverride: spark-operator

image:
  # -- Image repository
  repository: ghcr.io/googlecloudplatform/spark-operator
  # -- Image pull policy
  pullPolicy: IfNotPresent
  # -- if set, override the image tag whose default is the chart appVersion.
  tag: "v1beta2-1.3.8-3.1.1"

imagePullSecrets: 
  - name: regcred

sparkJobNamespace: spark-operator

resources:
  limits:
    cpu: 1
    memory: 512Mi
  requests:
    cpu: 1
    memory: 512Mi

webhook:
  enable: true
  port: 443
  namespaceSelector: "spark-webhook-enabled=true"

SparkApplication manifest

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-pi2
  namespace: spark-operator
spec:
  type: Scala
  mode: cluster
  image: "apache/spark:3.4.2"
  imagePullPolicy: IfNotPresent
  imagePullSecrets:
    - regcred
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.4.2.jar"
  sparkVersion: "3.4.2"
  timeToLiveSeconds: 600
  restartPolicy:
    type: Never
  volumes:
    - name: config-vol
      configMap:
        name: cm-spark-extra
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 3.4.2
    serviceAccount: airflow-next
    volumeMounts:
      - name: config-vol
        mountPath: /mnt/cm-spark-extra
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 3.4.2
    volumeMounts:
      - name: config-vol
        mountPath: /mnt/cm-spark-extra

Here is the container and volume spec of the pod being spun up

spec:
  volumes:
    - name: aws-iam-token
      projected:
        sources:
          - serviceAccountToken:
              audience: sts.amazonaws.com
              expirationSeconds: 86400
              path: token
        defaultMode: 420
    - name: spark-local-dir-1
      emptyDir: {}
    - name: spark-conf-volume-driver
      configMap:
        name: spark-drv-b14d8f8f2b497a58-conf-map
        items:
          - key: spark.properties
            path: spark.properties
            mode: 420
        defaultMode: 420
    - name: kube-api-access-tf4pb
      projected:
        sources:
          - serviceAccountToken:
              expirationSeconds: 3607
              path: token
          - configMap:
              name: kube-root-ca.crt
              items:
                - key: ca.crt
                  path: ca.crt
          - downwardAPI:
              items:
                - path: namespace
                  fieldRef:
                    apiVersion: v1
                    fieldPath: metadata.namespace
        defaultMode: 420
  containers:
    - name: spark-kubernetes-driver
      image: apache/spark:3.4.2
      args:
        - driver
        - '--properties-file'
        - /opt/spark/conf/spark.properties
        - '--class'
        - org.apache.spark.examples.SparkPi
        - local:///opt/spark/examples/jars/spark-examples_2.12-3.4.2.jar
      ports:
        - name: driver-rpc-port
          containerPort: 7078
          protocol: TCP
        - name: blockmanager
          containerPort: 7079
          protocol: TCP
        - name: spark-ui
          containerPort: 4040
          protocol: TCP
      env:
        - name: SPARK_USER
          value: root
        - name: SPARK_APPLICATION_ID
          value: spark-2d80cebdab33400b83cbfe61fd09faee
        - name: SPARK_DRIVER_BIND_ADDRESS
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        - name: SPARK_LOCAL_DIRS
          value: /var/data/spark-1505b0f4-c95a-4bba-aa76-9451b34af9ea
        - name: SPARK_CONF_DIR
          value: /opt/spark/conf
        - name: AWS_STS_REGIONAL_ENDPOINTS
          value: regional
        - name: AWS_DEFAULT_REGION
          value: us-east-1
        - name: AWS_REGION
          value: us-east-1
        - name: AWS_ROLE_ARN
          value: arn:aws:iam::123456:role/my-sa
        - name: AWS_WEB_IDENTITY_TOKEN_FILE
          value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
      resources:
        limits:
          cpu: 1200m
          memory: 896Mi
        requests:
          cpu: '1'
          memory: 896Mi
      volumeMounts:
        - name: spark-local-dir-1
          mountPath: /var/data/spark-1505b0f4-c95a-4bba-aa76-9451b34af9ea
        - name: spark-conf-volume-driver
          mountPath: /opt/spark/conf
        - name: kube-api-access-tf4pb
          readOnly: true
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        - name: aws-iam-token
          readOnly: true
          mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      imagePullPolicy: IfNotPresent

I am at a lost here. spark-conf-volume-driver is being set up as spark-drv-b14d8f8f2b497a58-conf-map and I see it under my kubernetes configmaps. I am on EKS and set webhook.enable to true and the port to 443. I did the work around configured a volumeMount under both the driver and executor although I do not see this being configured under my pod however I do not think this matters? As far as I know, everything is configured correctly. Is there somebody that can help?

I'm having the same issue on OCI. I've followed the same steps.

thof commented 5 months ago

I also noticed this problem with a very limited CPU (resources.limits.cpu: "100m") and two concurrent spark apps. It's very consistent and in that case the configmap for the driver was created for only one app. After updating the resources (requests to 1 CPU, and no limit) this odd behavior disappeared.

dannyeuu commented 3 months ago

this still happens, using spark-submit on a high number of submits sometimes occur.

pradithya commented 3 months ago

@dannyeuu Try to increase the spark operator's CPU request/limit. I encountered this issue when the operator is experiencing high utilization/throttling.

jacobsalway commented 3 months ago

Curious if this shares the root cause with another issue I saw. Does anyone see client-side throttling logs for the operator? Should look something like this:

Waited for ... due to client-side throttling, not priority and fairness ...

cometta commented 2 months ago

i faced this issue when using initContainer. if i dont use initContainer, then no error

J1635 commented 2 months ago

In my case the Config Map was created but the pod was crashing with the MountVolume.SetUp failed for volume "spark-conf-volume-driver" : configmap "spark-drv-###############-conf-map" not found.

I tried to view the logs of the pod before it failed and saw that there was another unrelated reason due to which the pod crashed.

After fixing that issue the configmap error went away.

karanalang commented 1 month ago

Im facing similar issue .. the cm is created but i get the error -> -poc--bbe7876b-cbcu 3m8s Warning FailedMount pod/structured-streaming-313-1727908779-driver MountVolume.SetUp failed for volume "spark-conf-volume-driver" : configmap "spark-drv-7b6887924f646fe0-conf-map" no

I'm on version - 1.1.27 (Spark 3.1.1) .. any solutions for this ?

MardanovTimur commented 1 week ago

Any update ?

MardanovTimur commented 6 hours ago

What i did.

We launch all our tasks via airflow SparkKubernetesOperator. I created a pool for all spark-kuber tasks with 2-3 slots in there and added additional sleep(40secs) after k8s resource "sparkapplication" create. And it helped to me. Still no error (2 weeks)